# NLP Project Part A: Model Performance and Comparison

For this project, you shall develop two separate natural language processing machine learning pipelines. The first shall target sentiment analysis, while the second shall target question answering. Throughout this process, you will compare different models on tasks, and learn to optimize hyperparameters effectively.

Note: We believe that this project may be more challenging than previous projects due to the number of libraries and tools involved. It is even more important to start early and iterate early here.

## Part A: Sentiment Analysis with Python
In this part, you will apply sklearn and related NLP libraries to predict user movie review sentiment on the [IMDB movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/). Before you begin, check that your installed `scikit-learn` version is as specified in `requirements.txt`; otherwise you may not pass the local tests. 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import sklearn

from gensim.models import Word2Vec

import pandas as pd
import numpy as np
import scipy.sparse as sp

We begin by loading a subset of the dataset, which contains 5000 movie reviews and their associated sentiment labels (i.e., whether a review is considered positive or negative).

In [4]:
df_reviews = pd.read_csv("imdb_reviews.csv")

In [5]:
# this cell has been tagged with excluded_from_script
# it will be ignored by the autograder
df_reviews.head()

Unnamed: 0,review,processed_review,sentiment
0,Taran Adarsh a reputed critic praised such a d...,taran adarsh repute critic praise dubba movie ...,negative
1,"Worth the entertainment value of a rental, esp...",worth entertainment value rental especially li...,negative
2,"I liked Antz, but loved ""A Bug's Life"". The an...",like antz love bug life animation put paid def...,positive
3,This reboot is like a processed McDonald's mea...,reboot like process mcdonald meal compare ang ...,negative
4,"The working title was: ""Don't Spank Baby"". <br...",work title spank baby wayne crawford go become...,positive


The `review` column contains raw review texts from the original dataset. However, it's always a good idea to process and clean text data before performing analysis. To reduce your workload, we have processed the text for you already. The column `processed_review` was constructed by processing and tokenizing the raw reviews, using the `preprocess_text` function from Project 3, and then joining the review tokens by a single space. From this point, you only need to focus on the `processed_review` and `sentiment` columns.

Next, let's look at the distribution of class labels:

In [6]:
# this cell has been tagged with excluded_from_script
# it will be ignored by the autograder
display(df_reviews['sentiment'].value_counts())

negative    2500
positive    2500
Name: sentiment, dtype: int64

We see that there are 2500 positive reviews and 2500 negative reviews. In other words, our dataset is perfectly balanced, and thus we can reasonably use accuracy as our metric of choice.

### Question 1: Count Vectorizer

Similar to P3, before we put any of our data into a machine-learning model, we need to convert our text into vectors of information to use through feature engineering. 

The first feature engineering task we will perform is building a term-frequency matrix. However, as you have already performed this task once manually in P3, we shall use `sklearn`, a standard machine learning library, to make things easier. Implement the function `count_vectorizer` that uses sklearn's `CountVectorizer` API to construct the term-frequency training matrix and testing matrix, along with the feature names (i.e., the list of words corresponding to the columns in the matrices). 

One point to keep in mind is that `CountVectorizer` will, by default, do its own preprocessing and tokenization (see the [documentation](https://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes) for more details). As these steps have already performed, we will need to overwrite sklearn's default behaviors by specifying that `analyzer` should be `str.split`.
 
Additionally, be careful with how you transform the test data. Unlike P3, where we ignored this step for simplicity, we want to fit our transformation tool **only on the training data**, but transform both the training and test data. This means that, if a token is in the test data but not in the training data, it will be ignored.

Notes:
* If you are using an earlier version of scikit learn while testing, you will need to use **get_feature_names()** instead of **get_feature_names_out()**.

In [7]:
def count_vectorizer(reviews_train, reviews_test = None):
    """
    Compute the term-frequency matrices for train_data and test_data using CountVectorizer.
    
    args:
        reviews_train (pd.Series[str]) : a Series of processed reviews for training
        
    kwargs:
        reviews_test (pd.Series[str]) : a Series of processed reviews for testing
    
    return:
        Tuple(tf_train, tf_test, features):
            tf_train (scipy.sparse.csr_matrix) : TF matrix for training
            tf_test (scipy.sparse.csr_matrix) : TF matrix for testing,
                or None if reviews_test is None
            features (List[str]) : the list of words corresponding to the columns in the TF matrices
    """
    vectorizer = CountVectorizer(analyzer=(lambda x: x.split()))
    X = vectorizer.fit_transform(reviews_train).toarray()
    features = vectorizer.get_feature_names_out()
    tf_train = sp.csr_matrix(X)
    
    if reviews_test is not None:
        X_test = vectorizer.transform(reviews_test)
        tf_test = sp.csr_matrix(X_test)
    else:
        tf_test = None

    return tf_train, tf_test, features

In [8]:
def test_count_vectorizer():
    reviews_train, reviews_test = train_test_split(df_reviews["processed_review"], random_state = 0)
    count_vec_train, count_vec_test, features = count_vectorizer(reviews_train, reviews_test)
    assert count_vec_train.shape == (3750, 27242)
    assert count_vec_test.shape == (1250, 27242)
    assert np.allclose(
        count_vec_train.sum(axis = 1)[:10].ravel().tolist()[0],
        [70, 65, 168, 77, 139, 132, 28, 139, 453, 89]
    )
    assert np.allclose(
        count_vec_test.sum(axis = 1)[:10].ravel().tolist()[0],
        [168, 60, 59, 144, 494, 135, 69, 119, 76, 68]
    )
    assert list(features[:10]) == ['00', '000', '00015', '007', '00pm', '00s', '01', '01pm', '02', '029']
    assert list(features[-10:]) == ['zucco', 'zucker', 'zukovic', 'zula', 'zuleika', 'zumhofe', 'zurer', 'zvezda', 'zwick', 'zylberstein']
    print("All tests passed!")
    
test_count_vectorizer()

All tests passed!


### Question 2: Bag of Words Vectorizer

Using the last function, let's also implement a so-called "Bag of Words" vectorizer. This vectorizer only considers the presence or absence of a token instead of the count. This makes it considerably more interpretable than the Count Vectorizer, at the cost of not including the frequency of potentially useful terms instead.  Implement the function `bow_vectorizer` that constructs a bag-of-words training and testing matrices, along with the feature names (i.e., the list of words corresponding to the columns in the matrices).  You might want to use the solution to the previous question to handle this question.

In [6]:
def bow_vectorizer(reviews_train, reviews_test = None):
    """
    Compute the bag of words matrices for train_data and test_data.
    
    args:
        reviews_train (pd.Series[str]) : a Series of processed reviews for training
    
    kwargs:
        reviews_test (pd.Series[str]) : a Series of processed reviews for testing
    
    return:
        Tuple(tf_train, tf_test, features):
            tf_train (scipy.sparse.csr_matrix) : Bag-of-Words matrix for training
            tf_test (scipy.sparse.csr_matrix) : Bag-of-Words matrix for testing,
                or None if reviews_test is None
            features (List[str]) : the list of words corresponding to the columns in the Bag-of-Words matrices
    """
    count_train, count_test, features = count_vectorizer(reviews_train, reviews_test)
    count_train[count_train.nonzero()] = 1 #count_train[count_train.nonzero()]/count_train[count_train.nonzero()]
    if reviews_test is not None:
        count_test[count_test.nonzero()] = 1 # count_test[count_test.nonzero()]/count_test[count_test.nonzero()]
    else:
        count_test = None
    return count_train, count_test, features

In [8]:
def test_bow_vectorizer():
    reviews_train, reviews_test = train_test_split(df_reviews["processed_review"], random_state = 0)
    bow_vec_trains, bow_vec_test, features = bow_vectorizer(reviews_train, reviews_test)
    assert bow_vec_trains.shape == (3750, 27242)
    assert bow_vec_test.shape == (1250, 27242)
    
    assert np.allclose(
        bow_vec_trains.sum(axis = 1)[:10].ravel().tolist()[0],
        [59, 64, 155, 62, 110, 110, 23, 119, 310, 64]
    )
    assert np.allclose(
        bow_vec_test.sum(axis = 1)[:10].ravel().tolist()[0],
        [113, 38, 52, 123, 258, 109, 59, 101, 74, 56]
    )
    assert list(features[:10]) == ['00', '000', '00015', '007', '00pm', '00s', '01', '01pm', '02', '029']
    assert list(features[-10:]) == ['zucco', 'zucker', 'zukovic', 'zula', 'zuleika', 'zumhofe', 'zurer', 'zvezda', 'zwick', 'zylberstein']
    print("All tests passed!")
    
test_bow_vectorizer()

All tests passed!


### Question 3: TF-IDF Vectorizer

Now let's also do the same with tf-idf, like we did in the last project. Implement the function `tfidf_vectorizer` that uses sklearn's `TfidfVectorizer` API to construct the TF-IDF training matrix and testing matrices, along with the feature names (i.e., the list of words corresponding to the columns in the matrices). Use the same parameter value for `analyzer` as you did in the previous question.

In [9]:
def tfidf_vectorizer(reviews_train, reviews_test = None):
    """
    Compute the TF-IDF matrices for train_data and test_data using TfidfVectorizer.
    
    args:
        reviews_train (pd.Series[str]) : a Series of processed reviews for training
    
    kwargs:
        reviews_test (pd.Series[str]) : a Series of processed reviews for testing
    
    return:
        Tuple(tf_train, tf_test, features):
            tf_train (scipy.sparse.csr_matrix) : TF-IDF matrix for training
            tf_test (scipy.sparse.csr_matrix) : TF-IDF matrix for testing,
                or None if reviews_test is None
            features (List[str]) : the list of words corresponding to the columns in the TF-IDF matrices
    """
    vectorizer = TfidfVectorizer(analyzer=(lambda x: x.split()))
    X = vectorizer.fit_transform(reviews_train).toarray()
    features = vectorizer.get_feature_names_out()
    tf_train = sp.csr_matrix(X)
    
    if reviews_test is not None:
        X_test = vectorizer.transform(reviews_test).toarray()
        tf_test = sp.csr_matrix(X_test)
    
    return tf_train, tf_test, features

In [10]:
def test_tfidf_vectorizer():
    reviews_train, reviews_test = train_test_split(df_reviews["processed_review"], random_state = 0)
    tfidf_vec_trains, tfidf_vec_test, features = tfidf_vectorizer(reviews_train, reviews_test)
    assert tfidf_vec_trains.shape == (3750, 27242)
    assert tfidf_vec_test.shape == (1250, 27242)
    assert np.allclose(
        tfidf_vec_trains.sum(axis = 1)[:10].ravel().tolist()[0],
        [7.03658925089979, 7.417196035144321, 11.492434722367015, 6.965673648338525, 9.428219597939362, 9.425632229448961, 3.9722806270035345, 9.635230284023372, 11.779155501275017, 7.44670396016231]
    )
    assert np.allclose(
        tfidf_vec_test.sum(axis = 1)[:10].ravel().tolist()[0],
        [7.2233277330801196, 4.869804242110142, 6.249091468966529, 9.689812079503804, 11.89432945296538, 9.115185225757216, 6.798492438570971, 8.57464867777901, 7.954528809138947, 6.81383392701789]
    )
    assert list(features[:10]) == ['00', '000', '00015', '007', '00pm', '00s', '01', '01pm', '02', '029']
    assert list(features[-10:]) == ['zucco', 'zucker', 'zukovic', 'zula', 'zuleika', 'zumhofe', 'zurer', 'zvezda', 'zwick', 'zylberstein']
    print("All tests passed!")
    
test_tfidf_vectorizer()

All tests passed!


### Question 4: Predicting review sentiment
Let's now see which feature construction method -- TF, TF-IDF, or Bag-of-Words -- is better for predicting review sentiments in our dataset. Our choice of learning algorithm here will be a support vector machine with Gaussian kernel (this means that it uses a different hypothesis function that can also account for non-linearly separable data). You can apply this learning algorithm by creating an instance of sklearn's [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) class, with `kernel = "rbf"` and `C = 1.0`.

Implement the function `predict_sentiment` that takes as input the `reviews` and `sentiment` columns of our IMDB dataset and performs the following tasks:
1. Convert the `sentiment` column to a vector `y` of 1s and -1s: `positive` corresponds to 1 and `negative` to -1.
1. Perform a [stratified k-fold split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) of the review and sentiment vectors, based on the provided `k`. Also set `shuffle` to `True` and `random_state` to the provided `seed`.
1. For $f$ from $1 \to k$:
     * Let fold $f$ be the test set, and the remaining $k-1$ folds be the training set.
     * Convert the training and testing reviews to feature matrices `X_train` and `X_test`, using either TF, TF-IDF, or Bag-of-Words. Which method to use is based on the function parameter `method`.
     * Train the SVM model on `X_train, y_train` and evaluate its accuracy $a_f$ on `X_test, y_test`.
1. Return $a_1, a_2, \ldots, a_k$.

**Notes**:
* As a reminder, accuracy is defined as
$$\text{Acc} = \frac{1}{n} \sum_{i=1}^n \mathbb{1}(y^{(i)} = \hat y^{(i)}).$$
You can also use the `score` function from `SVC` to quickly compute accuracy on test data.

In [8]:
def predict_review_sentiment(reviews, sentiments, method, k, seed = 0):
    """
    Compute the cross-validated accuracy of SVM with either TF or TF-IDF features
    in predicting review sentiment.
    
    args:
        reviews (pd.Series[str]) : a Series of all processed movie reviews
        sentiments (pd.Series[str]) : a Series of movie review sentiments,
            containing either "positive" or "negative"
        method (str) : a string which is either "TF", "TF-IDF", or "Bag"
            specifying which feature construction method to use
        k (int) : the number of folds in stratified k-fold split
    
    kwargs:
        seed (int) : the random generator seed for kfold split
        
    return:
        List[float] : a list of k accuracy values from evaluating a trained SVM model
            on each of the k folds, using the remaining folds as training data
    """
    sentiment_vec = sentiments.apply(lambda x: -1 if x == "negative" else 1 if x == "positive" else None).to_list()
    ## ^^ could have used np.where(x=="negative", -1, 1)
    skf = StratifiedKFold(n_splits = k, shuffle = True, random_state = seed)
    accuracies = []
    for i, (train_index, test_index) in enumerate(skf.split(reviews, sentiment_vec)):
        reviews_train = reviews[train_index]
        reviews_test = reviews[test_index]
        if method == 'TF':
            X_train, X_test, features = count_vectorizer(reviews_train, reviews_test)
        elif method == 'TF-IDF':
            X_train, X_test, features = tfidf_vectorizer(reviews_train, reviews_test)
        elif method == 'Bag':
            X_train, X_test, features = bow_vectorizer(reviews_train, reviews_test)
        else:
            raise ValueError("must be one of 'TF', 'TF-IDF', or 'Bag'")

        y_train = [sentiment_vec[idx] for idx in train_index]
        y_test = [sentiment_vec[idx] for idx in test_index]
        model = SVC(kernel='rbf', C=1.0).fit(X_train, y_train)
        acc = model.score(X_test, y_test)
        accuracies.append(acc)

    return accuracies

In [33]:
def test_predict_review_sentiment():
    # prediction based on TF
    count_vec_accs = predict_review_sentiment(df_reviews["processed_review"], df_reviews["sentiment"], "TF", 10)
    assert count_vec_accs == [0.878, 0.852, 0.85, 0.82, 0.824, 0.824, 0.82, 0.854, 0.848, 0.832], count_vec_accs
    
    # prediction based on TF-IDF
    tf_idf_accs = predict_review_sentiment(df_reviews["processed_review"], df_reviews["sentiment"], "TF-IDF", 10)
    assert tf_idf_accs == [0.88, 0.862, 0.854, 0.866, 0.848, 0.85, 0.848, 0.88, 0.868, 0.848], tf_idf_accs

    # prediction based on Bag-of-Words
    bow_accs = predict_review_sentiment(df_reviews["processed_review"], df_reviews["sentiment"], "Bag", 10)
    assert bow_accs == [0.888, 0.85, 0.852, 0.832, 0.83, 0.836, 0.842, 0.862, 0.848, 0.85], bow_accs

    print("All tests passed!")
    print("Cross-validated accuracy of SVM with TF matrices", np.mean(count_vec_accs))
    print("Cross-validated accuracy of SVM with TF-IDF matrices", np.mean(tf_idf_accs))
    print("Cross-validated accuracy of SVM with Bag-of-Words matrices", np.mean(bow_accs))
    
test_predict_review_sentiment()

All tests passed!
Cross-validated accuracy of SVM with TF matrices 0.8402
Cross-validated accuracy of SVM with TF-IDF matrices 0.8604
Cross-validated accuracy of SVM with Bag-of-Words matrices 0.849


**Note**: The above tests can take a while to run. The reference solution takes around 10 minutes on an Azure `Standard DS2 V3` compute.


We see that using TF-IDF features yields better cross-validated accuracy than using TF features or Bag-of-Words features (when the learning algorithm is SVM with RBF kernel and $C = 1.0$), although the difference in this case is not large.

### Question 5: Hyper-Parameter Optimization

With most machine learning algorithms, you will have the following:
- A variety of algorithms you want to assess on a particular dataset
- A variety of **hyper-parameters** that you want to evaluate

Unlike the **parameters** that are used to determine the model's outputs given its inputs ( what we change through training ), the **hyper-parameters** are inputs to the algorithm that need to be tuned separately. In practice, this is done through **hyper-parameter optimization** techniques, which try different combinations of hyper-parameters and then use those to optimize our performance appropriately. 

Doing this in a principled way is challenging. As we want some assurance of the performance of any model's hyper-parameters, we'd like to use cross-validation for the hyper-parameter selection as well as selecting a model. To do this in a principled way, one common approach is GridSearch with Nested Cross-Validation, where we perform cross-validation on the search procedure itself. 

Implement the function `optimize_random_forest` that takes as input the `reviews` and `sentiment` columns of our IMDB dataset and performs the following tasks:

1. Convert the `sentiment` column to a vector `y` of 1s and -1s: `positive` corresponds to 1 and `negative` to -1.
1. Perform a [stratified k-fold split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) of the review and sentiment vectors, based on the provided `k`. Also set `shuffle` to `True` and `random_state` to the provided `seed`.
1. For $f$ from $1 \to k$:
     * Let fold $f$ be the outer test set, and the remaining $k-1$ folds be the outer training set.
     * Perform a secondary stratified n-fold "inner" split of the outer training set, based on the provided `n`. Again, set `shuffle` to `True` and `random_state` to the provided `seed`.
     * For each hyper-parameter configuration, perform nested n-cross-fold validation using a RandomForestClassifier, setting `random_state` to `seed` to ensure deterministic training. Convert the training and testing reviews to feature matrices `X_train` and `X_test`, using either TF or TF-IDF. 
     * Pick the parameter setting which has the highest accuracy according to the n-cross-fold validation. Train a classifier on the **outer** training set, and record the **outer** test set accuracy.
1. Return $a_1, a_2, \ldots, a_k$, along with the most accurate hyper-parameter combinations found, $c_1, c_2, \ldots, c_k$

Note: For this question **do not use GridSearchCV**. Due to how random state is preserved, and how we are defining our state space, it is not compatible and may result in random state issues.

In [9]:
def optimize_random_forest(reviews, sentiments, params, method, k, n, seed = 0):
    """
    Compute the nested cross-fold accuracies and configurations found for a random forest model on the reviews and sentiments.
    
    args:
        reviews (pd.Series[str]) : a Series of all processed movie reviews
        sentiments (pd.Series[str]) : a Series of movie review sentiments,
            containing either "positive" or "negative"
        params (list[dict[str, Any]]) : a list of parameter configurations to try. The keys of each dict are the exact same as the keyword arguments of RandomForestClassifier.
        method (str) : a string which is either "TF","TF-IDF", or "Bag"
            specifying which feature construction method to use
        k (int) : the number of folds in the outer stratified k-fold split
        n (int) : the number of folds in the inner stratified k-fold split
    
    kwargs:
        seed (int) : the random generator seed for kfold split
        
    return:
        List[Tuple(float, dict[str, Any])] : a list of k accuracy values and configurations from evaluating a trained random forest model for each outer fold.
    """
    # DO NOT REMOVE THE FOLLOWING LINE FOR GRADING
    np.random.seed(seed)
    import random
    random.seed(1)


    # Create the labels array and StratifiedKFold Instance
    sentiment_vec = sentiments.apply(lambda x: -1 if x == "negative" else 1 if x == "positive" else None).to_list()
    skf = StratifiedKFold(n_splits = k, shuffle = True, random_state = seed)

    # Loop through outer folds, and index accordingly.
    outer_accuracy_combinations = []
    for i, (train_index, test_index) in enumerate(skf.split(reviews, sentiment_vec)):
        outer_reviews_train = reviews[train_index].copy().reset_index(drop=True)
        outer_reviews_test = reviews[test_index].copy().reset_index(drop=True)

        outer_y_train = [sentiment_vec[idx] for idx in train_index]
        outer_y_test = [sentiment_vec[idx] for idx in test_index]

        if method == 'TF':
            X_train_outer, X_test_outer, features = count_vectorizer(outer_reviews_train, outer_reviews_test)
        elif method == 'TF-IDF':
            X_train_outer, X_test_outer, features = tfidf_vectorizer(outer_reviews_train, outer_reviews_test)
        elif method == 'Bag':
            X_train_outer, X_test_outer, features = bow_vectorizer(outer_reviews_train, outer_reviews_test)
        else:
            raise ValueError("must be one of 'TF', 'TF-IDF', or 'Bag'")
    
        # Create second stratified K fold instance.
        skf_inner = StratifiedKFold(n_splits = n, shuffle = True, random_state = seed)

        # For each inner split, train the model
        #params_list = []
        #inner_accuracies = []
        best_param = None
        best_accuracy = None
        for param in params:
            accuracies = []
            for j, (inner_train_index, inner_test_index) in enumerate(skf_inner.split(outer_reviews_train, outer_y_train)):
                inner_reviews_train = outer_reviews_train[inner_train_index] # reviews[inner_train_index]
                inner_reviews_test = outer_reviews_train[inner_test_index] # reviews[inner_test_index]
    
                inner_y_train = [outer_y_train[idx] for idx in inner_train_index]
                inner_y_test = [outer_y_train[idx] for idx in inner_test_index]
                
                if method == 'TF':
                    X_train_inner, X_test_inner, features = count_vectorizer(inner_reviews_train, inner_reviews_test)
                elif method == 'TF-IDF':
                    X_train_inner, X_test_inner, features = tfidf_vectorizer(inner_reviews_train, inner_reviews_test)
                elif method == 'Bag':
                    X_train_inner, X_test_inner, features = bow_vectorizer(inner_reviews_train, inner_reviews_test)
                else:
                    raise ValueError("must be one of 'TF', 'TF-IDF', or 'Bag'")
    
                inner_model = RandomForestClassifier(random_state=seed, **param)
                inner_model.fit(X_train_inner, inner_y_train)
                inner_acc = inner_model.score(X_test_inner, inner_y_test)
                accuracies.append(inner_acc)
            # Average performance and select best model.
            avg_performance = np.mean(accuracies)
            if best_accuracy is None:
                best_accuracy = avg_performance
                best_param = param
            elif avg_performance > best_accuracy:
                best_accuracy = avg_performance
                best_param = param
            else:
                pass
        
        #inner_results_df = pd.DataFrame({'param': params_list, 'accuracy': inner_accuracies})
        #inner_results_df['param_string'] = inner_results_df['param'].astype(str)
        #avg_acc_df = inner_results_df.groupby('param_string').agg({'accuracy': np.mean}).sort_values(by='accuracy', ascending=False).reset_index()
        #max_param_string = avg_acc_df['param_string'][0]
        #best_param = inner_results_df[inner_results_df['param_string'] == max_param_string]['param'].to_list()[0]
        outer_model = RandomForestClassifier(random_state=seed, **best_param).fit(X_train_outer, outer_y_train)
        outer_acc = outer_model.score(X_test_outer, outer_y_test)
        outer_accuracy_combinations.append((outer_acc, best_param))

    return outer_accuracy_combinations

In [None]:
### NOT SURE WHY THIS DOESN'T PASS........
def test_optimize_random_forest():
    output = optimize_random_forest(df_reviews["processed_review"], df_reviews["sentiment"], [{'n_estimators':100, 'max_depth':None},{'n_estimators':500, 'max_depth':None}, {'n_estimators':1000, 'max_depth':10},{'n_estimators':1000, 'max_depth':100}, {'n_estimators':1000, 'max_depth':None}], "TF-IDF", 10, 2)
    assert output == [(0.87, {'n_estimators': 1000, 'max_depth': 100}), (0.842, {'n_estimators': 1000, 'max_depth': 100}), (0.86, {'n_estimators': 1000, 'max_depth': None}), (0.824, {'n_estimators': 1000, 'max_depth': 100}), (0.826, {'n_estimators': 1000, 'max_depth': 100}), (0.848, {'n_estimators': 1000, 'max_depth': 10}), (0.856, {'n_estimators': 1000, 'max_depth': None}), (0.858, {'n_estimators': 1000, 'max_depth': 100}), (0.838, {'n_estimators': 1000, 'max_depth': 100}), (0.834, {'n_estimators': 1000, 'max_depth': 100})], output
    print("All tests passed!")
    
test_optimize_random_forest()

**Note:** Again, the local tests do take significant time, so please plan ahead. On an Azure `Standard DS2 V3` compute, these tests take around 20 minutes to run. If your solution is taking significantly more time, check out the `njobs` argument of RandomForestClassifier. 


While using this, we can see that our nested cross validation has revealed a few things of note:

1. We have too few inner splits. When working with this type of setup, we need more data in the inner split to decide whether setting `max_depth` to `100` or `None` is reasonable for this data, as currently we are incapable of deciding between them.
1. It seems reasonable to look at higher `n_estimators` values. While we looked at lower values like `100` and `500`, they did not get chosen, and so it might be reasonable to consider larger values just to see if we get diminishing returns. We know from the conceptual materials that this seems reasonable, as the more estimators are available usually the better a RandomForest does.
1. In general, nested cross validation with a fixed set of choices seems rather time intensive. After running this grid, if we wanted to extend this to new sets of choices, we would need to specify a larger grid that would take exponentially more time. There are methods that can speed this up, but in general hyper-parameter optimization is a fairly difficult task.
1. Lastly, sklearn's RandomForestClassifier seems ill-suited to this task compared to the SVM we looked at previously. If you look online, Random Forests are usually considered state of the art for small dataset classification tasks yet, despite our extensive set of parameters, we were unable to find a setting that did as well as the SVM. It's important to realize that, no matter what the literature says is hot, all models can only be truly evaluated by testing. While it's possible there is a hyper-parameter setting where Random Forests outperforms SVMs for this featurization, this result is a good reminder that it's very important to be exhaustive in hyper-parameter optimization.

Through all of this, we can reasonably say that, if accuracy is our goal, then we should pick the SVC model. However, when deploying any machine learning model, it's possible that accuracy is not our main goal. Let's take the best RandomForestClassifier and compare it with our SVM model in more detail. Let's first train both models on the same training-test split:

In [20]:
svc_clf = SVC(kernel = "rbf", C = 1.0)
rf_clf = RandomForestClassifier(n_estimators = 1000, max_depth = None, random_state = 0, n_jobs = -1)
# train test split
reviews_train, reviews_test, sentiments_train, sentiments_test = train_test_split(
    df_reviews["processed_review"], df_reviews["sentiment"], random_state = 42
)
# TFIDF vectorizer
tfidf_vec_train, tfidf_vec_test, features = tfidf_vectorizer(reviews_train, reviews_test)
# fit SVM
svc_clf.fit(tfidf_vec_train, np.where(sentiments_train == "positive", 1, -1))
# fit random forest
rf_clf.fit(tfidf_vec_train, np.where(sentiments_train == "positive", 1, -1))

RandomForestClassifier(n_estimators=1000, n_jobs=-1, random_state=0)

Then, let's use sklearn's `classification_report` to report the precision, recall, f1-score for our models:

In [21]:
print(sklearn.metrics.classification_report(
    np.where(sentiments_test == "positive", 1, -1),
    svc_clf.predict(tfidf_vec_test)
))

              precision    recall  f1-score   support

          -1       0.87      0.85      0.86       617
           1       0.86      0.88      0.87       633

    accuracy                           0.87      1250
   macro avg       0.87      0.87      0.87      1250
weighted avg       0.87      0.87      0.87      1250



In [22]:
print(sklearn.metrics.classification_report(
    np.where(sentiments_test == "positive", 1, -1),
    rf_clf.predict(tfidf_vec_test)
))

              precision    recall  f1-score   support

          -1       0.84      0.86      0.85       617
           1       0.86      0.84      0.85       633

    accuracy                           0.85      1250
   macro avg       0.85      0.85      0.85      1250
weighted avg       0.85      0.85      0.85      1250



## Question 6: Model Selection Under Different Conditions
`classification_report` provides us with several metrics and, surprisingly, our SVC model is not completely dominating in all metrics compared to the RandomForest model. Implement the "use_model_under" model which returns either "random forest" or "svm" as the better choice given these model statistics. Refer to the conceptual content for more details about the metrics if you are unsure what they mean.

In [23]:
def use_model_under():
    """
    For each situation, pick either "random forest" or "svm" as better model.
    """
    # Situation 1: You want to make sure that you don't miss any positive reviews.
    dont_miss_positive_reviews = "svm"
    # Situation 2: Assuming class distribution is unchanged, you want to reasonably decide the sentiment of a review.
    dont_misclassify_negative_reviews = "svm"
    # Situation 3: You want to make sure that every review you pick as negative is negative.
    dont_misclassify_positive_reviews = "svm"
    # Situation 4: You want to make sure that you don't miss any negative reviews.
    dont_miss_negative_reviews = "random forest"
    return dont_miss_positive_reviews, dont_misclassify_negative_reviews, dont_misclassify_positive_reviews, dont_miss_negative_reviews
    

For this local test - it is important to note that this test simply is for testing the format of your answer so you can pass the autograder. Getting the local test correct will **not** lead to correctness in the autograder.

In [24]:
def test_use_model_under():
    answers = use_model_under()
    assert ["random forest" == answer or "svm" == answer for answer in answers]

test_use_model_under()

Overall, we can see that our model development process would have gone differently if we were focusing on a different metric. This comes to the heart of model development, as picking a metric is as important as developing a model. When deploying a model in production, we need to very closely pay attention to what metric we are using, as it will heavily change the model we end up picking.

Now let's move onto Part B, where we will change domains to work on another natural language processing pipeline, this time for Question Answering.