# Scikit-learn

Scikit-learn is a machine-learning library for Python.  Although it doesn't include some of the latest hip techniques (such as deep learning), it does include a wide array of different tools for all manner of machine-learning tasks.  Also, it combines these tools into a coherent framework, making it easy to "mix and match" different techniques.  In this notebook we'll take a look at some of the stuff you can do with `scikit-learn`.

## Basics

It all begins with importing scikit, but the actual package name is `sklearn`:

In [None]:
import sklearn

All of the useful functionality of scikit is contained within subpackages inside `sklearn`.  Unlike some other packages we've dealt with (like NLTK), with scikit you usually must explicitly import each subpackage you want to use.  For instance, scikit has a `linear_model` subpackage, but we can't access it yet because we didn't import it:

In [None]:
sklearn.linear_model

We can import it and then it will be there:

In [None]:
import sklearn.linear_model
sklearn.linear_model

Sometimes the packages we want may be deeply nested (like `sklearn.some_thing.other_thing.yet_another_thing`.  In such cases it can be convenient to use `from ... import ...` to get a shortcut to an inner package:

In [None]:
from sklearn import linear_model
linear_model

These are just technical details, but keep them in mind when working with scikit; you just have to be sure you've imported whatever tools you're trying to use.

### Scikit concepts

All scikit models take data in the same format: there are always X values and sometimes a Y values.  The X values are the data you're going to use to make your prediction (sometimes called the "independent" or "exogenous" variable), and Y is what you're trying to predict (sometimes called the "dependent" or "endogenous" variable).  If you're not trying to predict anything, but just trying to compare the data points to each other, there won't be any Y value.

The X values are always a table in the format they call "(n_samples, n_features)".  This is read as a two-dimensional description meaning that the rows of the table are samples and the columns are features.  "Samples" are the individual data points; one sample could be, for instance, one Yelp review or one wikipedia article.  "Features" are the things you want to use to make your prediction; examples would be words from the review text, numerical data such as the number of "cool" or "useful" votes on a review, etc.

The Y values typically are a single sequence of values, representing the thing you're trying to predict.  For instance, the Y value might be a list of the star ratings of all the reviews.  In some cases, you can predict more than one Y value at once; for instance, if you're trying to predict what categories a business falls into, you might try to simultaneously predict a yes/no result on each of several categories (restaurant, gas station, etc.).  If you're predicting multiple Y values, they'll also be in the "(n_samples, n_features)" format.

A separate distinction is that between "train" and "test" data.  When we're trying to predict something, we need a way to check if our model is actually able to predict new data.  To do this, we split our data into a "train set" and a "test set".  The training set is used to "teach" the model how to predict the output from the input.  Once we have trained the model, we run it on the test set, see what it predicts, and compare the predictions to the real results for the test set.  If the model's predictions are close, great; if not, back to the drawing board!  In general, both the train and test sets will have X and Y parts as described above.

For the most part, you can use Pandas data structures to represent your X and Y data.  So for your X data, you can use a DataFrame where each row is a sample and the columns represent the data you're using to make the prediction.  For your Y data, you'll typically use a Series of the values you're trying to predict.

However, one thing to keep in mind is that **scikit will usually not give you Pandas objects back**.  It will only give you plain `numpy` arrays.  `numpy` arrays are similar to Pandas structures in that they are tabular, but they are more limited in that they don't have names for the rows and columns (they can only be accessed by numerical index, similar to `.iloc`), and in that a single numpy array can only contain data of a single type (so, say, numbers or strings but not both).

So scikit will happily take in your Pandas values, but after it does its work it will give you numpy arrays back instead.  This means that you will typically have to do some "bookkeeping" in order to match up the results with the original data.  Usually this is not too hard though.

Another thing to keep in mind is that **the main prediction parts of scikit work only with numerical data**.  As you might imagine, this is important to know in computational linguistics, since we deal a lot with text.  But we can't actually feed text into our models.  We always have to somehow convert the text to numbers.  We can do this by, for instance, converting a Yelp review to a set of numbers representing "how many times the word *great* occurred in the review", "how many times the word *terrible* occurred in the review", and so on.  Fortunately, scikit does provide some preprocessing tools that will help do this for us, taking in our text and giving us numbers back out.

### Preparing your data for scikit

So, to work with scikit, our first task is to take our data and massage it into a form that scikit can deal with.

For "real" projects using textual data, this preprocessing is almost always going to be done in an automated way that gives us a large number of features.  However, to get a feel for the process, we are going to do it in "slow motion" and create some example features one by one.

We'll be using the file `reviews_champaign.csv`, which contains about 26,000 reviews of all sorts of businesses in the city of Champaign, Illinois.  First we'll load the file into pandas:

In [None]:
import pandas
df = pandas.read_csv('../Data/reviews_champaign.csv')
df.head()

The columns called "ReviewID", "UserID", and "BusinessID" are just random strings that uniquely identify each review, business, and reviewer.  They might be useful to you if you use this data for a project, and want to embellish it with other data that's available about the businesses and/or users, but for now they're just taking up space, so we'll reduce our DataFrame to just the columns we want to focus on:

In [None]:
df = df[["BusinessName", "Categories", "Date", "Stars", "Text", "Useful", "Funny", "Cool"]]
df.head()

The text of each review is in the `Text` column (which is cut off in the display because the reviews are long).  Each business also has a list of categories in the `Categories` column, separated by semicolons.

In the context of computational linguistics, we'll most often want to work with text data like the reviews.  But, as mentioned above, we need to somehow convert that to numerical data so we can work with it in scikit.  This process is called **feature extraction**.

As a simple example, we'll propose a very simple hypothesis and then try to test it using our data.  We hypothesize that occurrences of the words *food* and *eat* can help us decide whether a review is a review of a restaurant (as opposed to some other kind of business).  To test this, we'll extract two very simple features from the reviews, namely "how many times do the words *food* and *eat* appear in this review?"  Then we'll try to use that to predict whether the business is a restaurant or not, using the category information we have.

How can we count how many times these words appear in each review?  To warm up, let's start by looking not at the review text but at the categories.

As you can see, each value in `Categories` is a sequence of category names separated by semicolons.  (Note that in general the category names are plural if they are count nouns, e.g., "Restaurants" rather than "Restaurant".)  We could just do something like this:

In [None]:
df.Categories.str.contains("Restaurants").head()

As we saw briefly in the Pandas intro earlier in the class, the `.str` attribute of a Series gives us access to useful string methods if the column values are strings.  In this case, `str.contains` checks whether the passed string ("Restaurant") is a substring of each column value.

This will actually work fairly well for our purposes.  Just for completeness, we'll show a slightly different way as well.  We could split out the categories into a list of separate values:

In [None]:
df['CategoryList'] = df.Categories.str.split(';')
df.head()

Now that we have that, we could use `map` to check for "Restaurants" in each list:

In [None]:
df['IsRestaurant'] = df.CategoryList.map(lambda cats: 'Restaurants' in cats)
df.head()

This is slightly safer, since we know we're checking against the entire category name.  For "Restaurants" this is not a huge problem (although there is a category of "Pop-up Restaurants"), but for other categories we might have to be careful about accidentally including a category whose name is a substring of another.  For instance, if we just used `.str.contains` to look for "Golf", we would also include the category "Disc Golf", which we probably wouldn't want to do.  If we searched for "Adult" we would get both "Adult Entertainment" and "Adult Education", which we probably wouldn't want to combine!

## (Crappy) Feature Extraction

The reason I mention these issues is that they're much more important when it comes to handling the actual review text.  Remember that we want to count how many times the words *food* and *eat* occur in each review.  One simple way to look for *eat* would be to do this:

In [None]:
df.Text.str.count("eat").head(10)

The problem with this is that this just counts whether the *string* "eat" occurs, not whether the *word* "eat" occurs.  For instance, it will also count occurrences of the words "meat", "feat", "repeat", "beaten" and so on.  When we're just looking at the categories, this is sometimes a problem but not usually a giant problem, because the number of category names is relatively small and there are few overlaps.  But with the review text it is critically important not to do this, because of course the reviews may contain all sorts of words, and there are many words which occur as substrings of other words even though they are unrelated in meaning.

To get an accurate count we need to do a real tokenization.  This is one situation where NLTK can have an advantage over Spacy.  Because Spacy does so much more --- not only tokenizing but part-of-speech tagging, dependency parsing, etc. --- it typically takes substantially longer than NLTK to process a given text.  This was not so noticeable in our earlier use of Spacy because we were only using to to process one text at a time.  But we'll see the difference if we try to process many texts.  For a quick comparison, we'll set up both NLTK and Spacy.

In [None]:
import nltk
tokenizer = nltk.tokenize.TweetTokenizer()

import spacy
nlp = spacy.load('en')

We'll use Jupyter's timing abilities to time how long it takes to tokenize the first 1000 reviews:

In [None]:
%%time
df.Text.iloc[:1000].map(tokenizer.tokenize)

Took about half a second for me.  Now let's try it with Spacy:

In [None]:
%%time
df.Text.iloc[:1000].map(nlp)

That took about a minute and a half for me --- more than 150 times slower!  Remember that this is only the first 1000 reviews.  There are 26,000 reviews in this dataset, so we'd estimate that processing them all with Spacy would take nearly 45 minutes.

It is possible to speed Spacy up by disabling some of its abilities.  Here we'll disable the parser, tagger, and named entity recognizer.

In [None]:
fast_nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])

Now we'll see how fast it is:

In [None]:
%%time
df.Text.iloc[:1000].map(fast_nlp)

About 5 seconds for me.

That's still 10 times slower than NLTK, but much better.  At this rate it would take about 2 minutes to process the whole Champaign review dataset, which is pretty manageable.  For your project, you may want to use Spacy (either the fast or the slow version) if you want to leverage its abilities.  For instance, you might get better results on a sentiment analysis task if you lemmatize the reviews.  For this tour, however (and for your assignment), we're going to use NLTK just so we can move along quicker.  (Also, as we'll see later, in practice we may use a yet different tokenization process built into scikit.)

Let's go ahead and tokenize the whole thing with NLTK.  We'll also convert everything to lowercase, which is usually a good idea since it lets us count "eat" and "Eat" the same (and it helps us practice with `lambda` :-):

In [None]:
df['Toks'] = df.Text.map(lambda txt: tokenizer.tokenize(txt.lower()))
df.head()

Now we can easily get out the counts we want:

In [None]:
df['NEat'] = df.Toks.map(lambda toks: toks.count('eat'))
df['NFood'] = df.Toks.map(lambda toks: toks.count('food'))
df.head()

One thing that can be useful just to get an overview of the data is to group by our desired output category and look at the mean values of our putative predictors:

In [None]:
df.groupby('IsRestaurant')[['NEat', 'NFood']].mean()

This provides support for our hypothesis on a very basic level: the reviews where `IsRestaurant` is True have, on average, 0.13 occurrences of *eat*, while those where `IsRestaurant` is False have only 0.02 occurrences.  The same pattern holds for *food*.  In other words, overall, restaurant reviews have more occurrences of these two words.

Let's just take a moment to remind ourselves that this is *not* a rigorous way to test such a hypothesis!  If we were really doing that, we'd want to use a chi-squared test or some similar method that could give us a confidence interval or p-value for determining whether the result is statistically significant.  But we're not doing that.  We're just using this as an example of how to use scikit for machine learning.  So let's do it!

## Using scikit models

The only thing we need to do before beginning to use scikit is to make clear in our own minds how our data will be used.  In this case it is as follows:

1. We are trying to use `NEat` and `NFood` to predict `IsRestaurant`
2. Therefore, our X value consists of the two columns `NEat` and `NFood` and our Y value consits of the column `IsRestaurant`.
3. Because `IsRestaurant` is distinguishing between categories (restaurant and non-restaurant), this is a classification task.  (And because there are only two categories, it is a binary classification task.)

### Train/test split

It's finally time to unleash the power of scikit.  The first step is to split our data into training and testing sets.  To do that we need the `model_selection` subpackage, and we'll use the `train_test_split` function from it:

In [None]:
import sklearn.model_selection
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(df, df.IsRestaurant)

What `train_test_split` does is it randomly split the data into two pieces, and it gives us back a train and test set for X and Y, in the order shown: X train, X test, Y train, Y test.  It does this when we give it the X and Y of the full dataset.  Suppose we take a peek at the results:

In [None]:
x_train

In this case, `train_test_split` did give us back a Pandas object.  This relatively simple function is dumb enough that it returns whatever data type we give it.  This is why we passed `df` to `train_test_split`, even though our actual X values are only two columns of `df`.  Having the train/test sets contain our whole table will help us in matching up the predicted values with the table if we want to later on.

### Using a classification model

Now we can move on to the meat of the task.  For this example, we'll use the LogisticRegression model, which is a classification algorithm.  We import it first:

In [None]:
from sklearn.linear_model import LogisticRegression

Creating the model object is as easy as pie:

In [None]:
lr_model = LogisticRegression()

Now we train the model by calling `.fit()`.  We pass it the X and Y values **from our training set**.  Notice that at this point we have to grab just the columns we want for X:

In [None]:
lr_model.fit(x_train[['NEat', "NFood"]], y_train)

Hey, nothing happened!

Actually a lot happened.  In that brief, magical moment, the model learned whatever it could about the associations between our input and output variables.

Now that we have called `.fit()` on it, the model is a **fitted** model object, which has learned something from the data, and is ready to share that knowledge with us.

### Getting predictions

To find out what it hath wrought, we'll now tell the model to give us its predictions **on the test set**:

In [None]:
preds = lr_model.predict(x_test[["NEat", "NFood"]])

We can peek at what we've got:

In [None]:
preds

Notice that the result is no longer a Pandas object but an unvarnished numpy array.  We can see, though, that it's got its predictions in there, True or False for each business in the test set.

A good way to deal with these prediction values is to put them into a DataFrame along with the true Y values.  This lets us match them up side by side.  We'll call this table `output`:

In [None]:
output = pandas.DataFrame({"Actual": y_test, "LR": preds})
output.head(10)

### Evaluating the model

Now it's time to evaluate the model.  One simple way to eyeball it is to use the Pandas `.crosstab` function to show us the confusion matrix:

In [None]:
pandas.crosstab(output.Actual, output.LR)

Looks like our model was not so hot.  It has a low false positive rate (few non-restaurants were categorized as restaurants) but a high false negative rate (many restaurants were incorrectly categorized as non-restaurants).  We can get a clearer picture of the performance by using some of the classification metrics.  Each metric (for classification) takes as input the true values and the predicted values.  For most classification metrics, the best score is 1 and zero is a bad score; some metrics also may give negative scores for models that are so bad that reversing their predictions would be better.

In [None]:
import sklearn.metrics
print("Classification accuracy:", sklearn.metrics.accuracy_score(output.Actual, output.LR))
print("F1 score:", sklearn.metrics.f1_score(output.Actual, output.LR))
print("AUC score:", sklearn.metrics.roc_auc_score(output.Actual, output.LR))
print("Matthews coefficient:", sklearn.metrics.matthews_corrcoef(output.Actual, output.LR))

What do these numbers mean?  Not much, by themselves.  For instance, the classification accuracy is 0.68, meaning the model chose the correct result (restaurant or non-restaurant) 68% of the time.  A bit more than two-thirds.  Awesome!

. . . OR IS IT???

Two-thirds accuracy might sound pretty good, but to really tell, we need something to compare it against.  Let's make a really simple model.  To decide what simple model to make, let's look back at our original data and see how many restaurants and non-restaurants there are overall:

In [None]:
df.IsRestaurant.value_counts()

Roughly two-thirds of the businesses are restaurants.  So one simple model would just be, if you're trying to guess whether something is a restaurant or not, just always guess yes!  You can expect to be right two-thirds of the time.  We can already see that this means our 68% performance is not that impressive; we could have done just as well with this really simple "model" that just guesses the same thing every time.

Let's add this "model" to our `outputs` table.  (Setting a column of a DataFrame to a single value just repeats the value down the whole column.)

In [None]:
output['Dumb'] = True
output.head()

Now let's compute the scores of our dumb model:

In [None]:
print("Classification accuracy:", sklearn.metrics.accuracy_score(output.Actual, output.Dumb))
print("F1 score:", sklearn.metrics.f1_score(output.Actual, output.Dumb))
print("AUC score:", sklearn.metrics.roc_auc_score(output.Actual, output.Dumb))
print("Matthews coefficient:", sklearn.metrics.matthews_corrcoef(output.Actual, output.Dumb))

The classification accuracy is essentially the same, and on at least one metric (F1 score), the dumb model is better than our scikit model!  This should give us pause.

### Peeking inside the model

Before we move on to consider how we might improve our model, let's take a look at the fitted model object itself.  In addition to providing predictions, many kinds of fitted models can give us other information.  As a simple example, most classifiers can tell us not only their predictions, but the probability of each category, with the `.predict_proba` method:

In [None]:
lr_model.predict_proba(x_test[["NEat", "NFood"]])

What we have here is a two-dimensional array, where the first column tells us the probability of the first category, and the second tells us the probability of the second category.  Since our two categories were `True` and `False`, and, by convention in the world of computers, `True` is 1 and `False` is 0, that means that the first column here is telling us the probability that a given review is *not* of a restaurant, and the second column is telling us the probability that it *is* of a restaurant.  Notice that the probabilities in each row add up to 1, since that represents the total probability of all categories.  (If we were doing a multiclass classification, there would be more columns, giving us the probability of each possible category, in sorted order.)

We can see that the numbers here agree with the predictions above.  The first three rows all have the second number greater than the first, so the first three predictions are all True.  We can also, see, though, that for the first two rows the model was quite confident about this prediction (91% chance of True), whereas for the third row it was pretty much a toss-up.  And, looking back at the predictions, we can see that for those first two rows the model was correct, whereas for the third row it was wrong.  This suggests that perhaps, even though the model is wrong, it may at least know when it's likely to be wrong.

This can be useful information in many situations, especially when the model is being used to filter information for humans.  For instance, machine-learning models can be used to flag or delete objectionable content (obscenity, spam, pornographic images, etc.).  But you could use the model's probability prediction to bring in a human observer for tough cases.

In the case of this LogisticRegression model, we can also look at its coefficients in the `.coef_` attribute:

In [None]:
lr_model.coef_

Remember that we had two predictors here (`NEat` and `NFood`).  Roughly speaking, the coefficients are telling us how useful each of those is for making the classification.  Higher values here mean that higher values of that predictor variable push the prediction more towards the "higher" category, which as we noted above is True.  The values are in the order that we provided the predictors (in other words `NEat` is first).  Both numbers are positive, indicating that higher values of `NEat` and `NFood` nudge the prediction toward True; this is as we would expect, since it means more occurrences of *eat* and *food* increase our confidence that the review describes a restaurant.  The value for `NFood` is higher, meaning that the word *food* is more predictive of restaurant reviews than the word *eat*.

Different types of models give different kinds of information.  All classification models can make a prediction with `.predict()`, but once we go beyond that the information available in these extra model attributes (like `.coef_`) is often more closely tied to the mathematical underpinnings of the model, meaning that it may not be possible to compare different types of models by using such attributes.  However, these attributes can be useful to get a conceptual understanding of "how" the model is making its decisions.

### Trying again

We saw above that our LogisticRegression model was not really any better than a simplistic model that just predicts "this is a restaurant" every time.  There are three ways to proceed from here:

1. Try tweaking parameters of our original model to improve it.
2. Try a different type of model from scikit.
3. Try adding more data.

We'll try idea #2 first.  This time we'll use what's called a Random Forest algorithm:

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
rf_model = ExtraTreesClassifier()

Since we've already done the train/test split, using this model is exactly the same as using the other one:

In [None]:
rf_model.fit(x_train[['NEat', "NFood"]], y_train)

And getting the predictions is the same too.  We'll add them to our `output` table:

In [None]:
output['RF'] = rf_model.predict(x_test[["NEat", "NFood"]])
output.head()

We can again compute the evaluation metrics:

In [None]:
print("Classification accuracy:", sklearn.metrics.accuracy_score(output.Actual, output.RF))
print("F1 score:", sklearn.metrics.f1_score(output.Actual, output.RF))
print("AUC score:", sklearn.metrics.roc_auc_score(output.Actual, output.RF))
print("Matthews coefficient:", sklearn.metrics.matthews_corrcoef(output.Actual, output.RF))

Suspiciously, this model has the exact same scores as our LogisticRegression model above.  Looking at the outputs table, we can see that the two models appear to be making the exact same predictions.

When you see this kind of thing, it is usually a sign that your models are reaching the limits of what can be done with the data.  If two very different models make the exact same predictions, it may indicate that the data simply do not have much predictive power.

### Trying yet again

To address this, let's beef up our data a bit.  We'll add columns for a few other words that we think might help us distinguish restaurants from non-restaurants:

In [None]:
df['NDelicious'] = df.Toks.map(lambda toks: toks.count('delicious'))
df['NTasty'] = df.Toks.map(lambda toks: toks.count('tasty'))
df['NHungry'] = df.Toks.map(lambda toks: toks.count('hungry'))
df.head()

To re-run our models again, we'll need to make a new train/test split, since our data changed.

When we do a new train/test split, we might get different performance even from our same old models, because maybe we'll get luckier in terms of how easy it is to predict the data in the test set.  Thus, if you change the train/test split, you should always re-run all models.  Here, we'll do this explicitly by rewriting the code, so we can see the unfolding narrative of our gradually increasing knowledge of scikit, but in your project (or assignment) it may be appropriate to just edit your code and re-run it rather than copying and pasting it again.

Here is our earlier code in compact form:

In [None]:
# train/test split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(df, df.IsRestaurant)

# create the models
lr_model = LogisticRegression()
rf_model = ExtraTreesClassifier()

# fit the models
# note that we now include extra columns
lr_model.fit(x_train[['NEat', "NFood", "NDelicious", "NTasty", "NHungry"]], y_train)
rf_model.fit(x_train[['NEat', "NFood", "NDelicious", "NTasty", "NHungry"]], y_train)

# create output table
output = pandas.DataFrame({"Actual": y_test})
output["Dumb"] = True
output["LR"] = lr_model.predict(x_test[["NEat", "NFood", "NDelicious", "NTasty", "NHungry"]])
output["RF"] = rf_model.predict(x_test[["NEat", "NFood", "NDelicious", "NTasty", "NHungry"]])

output.head(10)

It looks like we may still have the same problem.  Let's look at the scores.  Here I wrote a loop so I wouldn't have to write the code out three times:

In [None]:
for column in ["Dumb", "LR", "RF"]:
    print("Scores for model", column)
    print("Classification accuracy:", sklearn.metrics.accuracy_score(output.Actual, output[column]))
    print("F1 score:", sklearn.metrics.f1_score(output.Actual, output[column]))
    print("AUC score:", sklearn.metrics.roc_auc_score(output.Actual, output[column]))
    print("Matthews coefficient:", sklearn.metrics.matthews_corrcoef(output.Actual, output[column]))
    print()

There's some improvement here.  The classification accuracy of our two non-dumb models improved to a few percentage points more than the dumb model.  But still not great.

You can see that adding individual columns in this way would be a somewhat laborious task.  That's why in a bit we're going to look at a way to accomplish this in a more automated fashion.

### Using a regression model

First, though, let's look at an example of a regression task.  We'll try to predict the star rating of a business based on some chosen words.  We'll add columns for those words to our data:

In [None]:
df['NGood'] = df.Toks.map(lambda toks: toks.count('good'))
df['NGreat'] = df.Toks.map(lambda toks: toks.count('great'))
df['NAmazing'] = df.Toks.map(lambda toks: toks.count('amazing'))
df['NBad'] = df.Toks.map(lambda toks: toks.count('bad'))
df['NAwful'] = df.Toks.map(lambda toks: toks.count('awful'))
df['NGross'] = df.Toks.map(lambda toks: toks.count('gross'))

df.head()

What would the equivalent of our "dumb" model above be?  Well, one simple thing would be to just predict the average star rating for every business.  Let's see what that average is:

In [None]:
df.Stars.mean()

So our dumb model will just predict every reviewer gave 3.566 stars.

The model creation process below should look very familiar, but now we use the `Stars` column as our Y.  Also, instead of using LogisticRegression and ExtraTreesClassifier, we'll use LinearRegression and ExtraTreesRegressor.  Notice how the entire process is essentially exactly the same as what we did above.

In [None]:
# train/test split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(df, df.Stars)

# create the models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import ExtraTreesRegressor
lr_model = LinearRegression()
rf_model = ExtraTreesRegressor()

# fit the models
# I created a variable to hold the columns we want to use, rather than retyping them each time
x_columns = ["NGood", "NGreat", "NAmazing", "NBad", "NAwful", "NGross"]
lr_model.fit(x_train[x_columns], y_train)
rf_model.fit(x_train[x_columns], y_train)

# create output table
output = pandas.DataFrame({"Actual": y_test})
output["Dumb"] = 3.566
output["LR"] = lr_model.predict(x_test[x_columns])
output["RF"] = rf_model.predict(x_test[x_columns])

output.head(10)

Now lets evaluate our regression models.  For this we'll use the "mean absolute error" and "mean squared error" metrics.  I also include the "root mean squared error", which is just the square root of the mean squared error:

In [None]:
for column in ["Dumb", "LR", "RF"]:
    print("Scores for model", column)
    print("Mean absolute error:", sklearn.metrics.mean_absolute_error(output.Actual, output[column]))
    print("Mean squared error:", sklearn.metrics.mean_squared_error(output.Actual, output[column]))
    print("Root mean squared error:", sklearn.metrics.mean_squared_error(output.Actual, output[column])**0.5)
    print()

Note that these metrics represent the *error*, or residual, between the real value and our prediction.  Thus, *smaller* values are better, because they mean the difference between the real value and our prediction was small.

In this case our models appear to be a slight improvement on the dumb model.  But let's take a look at the values our model is predicting:

In [None]:
output.describe()

Notice that the LinearRegression model is sometimes predicting values less than 1 (or even less than zero!) and bigger than 5.  Such values are impossible in this context, because Yelp only allows users to give ratings from 1 to 5 stars (despite the many negative reviews ranting about how they wish they could give zero stars).  But the regression model doesn't know that, and will happily predict values outside that range.

This creates some built-in error, because a prediction of zero stars is guaranteed to be off by at least one whole star.  We could probably improve that LinearRegression model by chopping off the scores, so that anything less than 1 is counted as 1, and anything bigger than 5 is counted as 5.

To do that, we can use the Pandas `clip` method, which clips values to a specified range:

In [None]:
output.LR = output.LR.clip(1, 5)
output.describe()

Now our LinearRegression is constrained to reality.  Let's see how it affected the scores:

In [None]:
for column in ["Dumb", "LR", "RF"]:
    print("Scores for model", column)
    print("Mean absolute error:", sklearn.metrics.mean_absolute_error(output.Actual, output[column]))
    print("Mean squared error:", sklearn.metrics.mean_squared_error(output.Actual, output[column]))
    print("Root mean squared error:", sklearn.metrics.mean_squared_error(output.Actual, output[column])**0.5)
    print()

There was a marginal improvement in the LR model's score, but it's still slightly worse than the Random Forest.

Sometimes you can improve model results by doing a little cleanup like this after the model does its work.  For instance, we can also see that our models are predicting fractional numbers of stars, which is also impossible.  It might be that if we rounded our predictions, we'd improve our accuracy a bit.  (Or it might make things worse; it all depends on whether our predictions are close enough to the correct whole number of stars.)

However, if you do this, it's critical that you keep track of it in your code, because by introducing steps like that take place outside the model, you've added one more thing that you, the human, have to do yourself.  If you go to someone and say "Hey, I have a model that can get a root mean squared error of 0.57!" and they ask you to run it on some data, and you forget to do that cleanup step where you rounded or clipped the numbers, your model will suffer and you'll have egg on your face.  (Scikit provides something called "Pipelines" that can organize some of these steps in the process, but we're not going to cover that here.)

Although our "smart" models may seem like they're not much better than the dumb model, it's important to remember that they're using extremely limited data.  We only picked six words to consider!  Once we get to the point where we're considering a large number of words, we may see a significant improvement.

Also remember that, depending on the context, even a small improvement may be worthwhile.  Netflix gave people a million dollars for improving RMSE by 10%.  We just improved it by about 9%.  Of course, Netflix was trying to improve a model that was already quite good, whereas we just improved a model that was laughably crude.  Nonetheless, it remains true that what counts as a worthwhile improvement depends on your application.

## Non-crappy feature extraction

The above part of this tour was intended to give us an overview of the process of building, fitting, and predicting with scikit models.  However, as we discussed, the way we created our features there was somewhat cumbersome.

For one thing, it's impractical to manually create lots of individual features for individual words we think might be important.  Even just doing it for a half-dozen or so words got irritating.

For another thing, it's also not really leveraging the power of machine learning.  The whole idea of this enterprise is to try to let the machine figure out what's important.  If we sit there speculating about what words to include, we're making human judgements that may be tainted by human folly, when our goal was to let algorithms make such judgements in principled ways.  It's as if we bought a fancy, expensive Instant Pot but didn't know how to use it, so we just microwaved a Hot Pocket instead.

But no longer.  Now we'll look at how we can get scikit to automatically create a large number of features representing the occurrence of a large number of words in our texts.  It's actually fairly simple.

### Vectorizers

Scikit has a subpackage called `feature_extraction`, and in there is yet another subpackage called `text`.  (There's also one called `image` for extracting features from images, but that's not so germane for us computational linguists.)  In there are two main classes that we might use: `CountVectorizer` and `TfidfVectorizer`.  These essentially do what we did above, creating separate columns that count how many times each word appears, but they do it automatically, providing us with some simple parameters to tweak to affect the details of how they do it.

The different between the two is that `CountVectorizer` just works with raw counts of words, while `TfidfVectorizer` weights the counts by their inverse document frequency, meaning that words that occur in many documents (that is, in our case, many Yelp reviews) are given less weight.  We'll start with CountVectorizer.

Note that this is a different kind of "vector" than what we saw before with Spacy's word vectors.  In Spacy, each individual word has a vector of a fixed length, which (more or less) represents that word's meaning in a 300-dimensional space, relative to other words.  We could get a vector for a sentence or a word, but we did so only by adding up the vectors of the individual words, and getting a vector of the same size.  In contrast, the vectors that CountVectorizer gives us are only telling us about the document (i.e., the review) as a whole; they don't contain any information about the meanings or relationships between individual words.

I'm going to reload the data here, since we no longer need those manually-created token and feature columns, and also because that way we don't have to keep scrolling up to remind ourselves where we're at.  I'll just recreate the `IsRestaurant` column as we did above.

In [None]:
import pandas
import sklearn

df = pandas.read_csv('../Data/reviews_champaign.csv')
df = df[["BusinessName", "Categories", "Date", "Stars", "Text", "Useful", "Funny", "Cool"]]
df['CategoryList'] = df.Categories.str.split(';')
df['IsRestaurant'] = df.CategoryList.map(lambda cats: 'Restaurants' in cats)
df.head()

The feature extraction vectorizers are "transformers" (or "transformations") that take some data as input and give some other data as output.  They're not "models" because they don't make predictions.  We just use them to prepare data for our models.

However, like models, they still need to be trained.  Training a text feature extractor essentially means teaching it what words occur in the data set, and how often.  For instance, the TfidfVectorizer needs to how many different reviews contain the word *eat* in order to decide how much to weight it in the transformed result.

Because the vectorizers need training, it is important that, before doing feature extraction, we do our train/test split and **train the feature extraction only on the training set**.  If we trained it on the whole data, we would be cheating, since then the vectorizer would "know" how many times *eat* (and every other word) occurred in the test data as well as the training data, and this might help it make better predictions.

For this example we'll again try to predict whether a review is about a restaurant.  So first we'll do our train/test split.

In [None]:
import sklearn.model_selection
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(df, df.IsRestaurant)

Now we'll create a CountVectorizer:

In [None]:
import sklearn.feature_extraction
vectorizer = sklearn.feature_extraction.text.CountVectorizer(stop_words='english')

Then we train the vectorizer on our training data:

In [None]:
vectorizer.fit(x_train.Text)

The feature extractor is now fitted, so we can use its `.transform` method to get out a gigantic table of all the word counts.  We'll call this method on both the train and test sets, so that the data for both is transformed in the same way.

In [None]:
features_train = vectorizer.transform(x_train.Text)
features_test = vectorizer.transform(x_test.Text)

Dare we peek at what those are?

In [None]:
features_train

Whew!  Mercifully, it doesn't display the whole thing.  Why mercifully?  Because look at what it's telling us: that thing is a matrix with 19746 rows and 31506 columns!  What a monster!

Fortunately, it's stored as a sparse matrix, which is why your computer hasn't crashed.  If every entry in that matrix were stored as a separate value, that would be this many values:

In [None]:
19476*31506

Over 600 million!  Since these are 64-bit ints, they take up 8 bytes each.  That would be. . .

In [None]:
19476*31506*8

Almost 5 billion bytes, or approximately 4.5 gigabytes.  But, as the output above tells us, only about 1.3 million of those entries have anything in them.  The rest are all zero, because most words don't occur in most reviews.

We can peek at part of one row if we want:

In [None]:
features_train[3,11200:11300].todense()

`.todense()` converts the sparse matrix into a "dense" one --- that is, a normal one where every value is stored separately.  Now that you know this, let us never speak of it again.  **Do not**, I repeat, do not, go around calling `.todense()` on sparse matrices.  If you accidentally call it on a large sparse matrix you may cause your computer to hang as Python inflates the sparse matrix into all its bazillion separate numbers and takes up all your memory.  (Actually, I've tried to choose a small enough data set here that that won't happen, but the principle is sound.)

Unsurprisingly, the data we see mostly zeros.  But there are a few nonzero entries in there.  What can it all mean?

We can get some help from the `.vocabulary_` attribute of the vectorizer (note it has an underscore at the end):

In [None]:
print(len(vectorizer.vocabulary_))
vectorizer.vocabulary_

It's a gigantic dictionary containing, oddly enough, 31,506 entries --- the same as the number of columns in our sparse matrix.

What this dictionary is telling us is what word each column of the matrix represents.  For instance, we can see which column refers to *food*:

In [None]:
vectorizer.vocabulary_['food']

Then if we look at column 11254 in the giant matrix:

In [None]:
features_train[:10, 11254].todense()

That's showing us how many times *food* appeared in each review.  Note, however, that these numbers don't match up directly with our `df`, because this is only the training set, which is just some random rows from the original data.  These would correspond to the counts of *food* in the first 10 rows of `x_train`.

In [None]:
# put in a number from above to check it
x_train.Text.iloc[0]

So what scikit has done is do just what we were doing, making a count of how many times words occur in each review, only it's done it for 31,000+ different words, all at once.

The CountVectorizer has a tokenization process akin to NLTK or Spacy, but it is even more crude: it strips out all punctuation and whitespace completely.  It also converts everything to lowercase.  Usually this is fine for our purposes.  It is possible to override the tokenization process to get scikit's vectorizers to use your own tokenizer (such as one from NLTK or Spacy).  It is also possible to get the vectorizer to remove stop words, or to compute counts of ngrams as well as individual words.  We won't discuss that here, but you can ask about it and/or look into it if you'd like to play around with it for your project.

### Fitting a model again

Now that we have our text features, we can use them as input to a model.  The steps below should look familiar, but note that now we pass the `features_train` as our X value to `.fit()` and `features_test` as our X value to `predict`.  Because `features_train` and `features_test` are sparse arrays, we can't easily "add" them to our original `df` as we did with our manually-created columns.  This means we have to keep track of the relationship between rows in `features_train` (and `features_test`) and rows in `x_train`, `y_train`, etc.

In [None]:
from sklearn.linear_model import LogisticRegression

# create the model
lr_model = LogisticRegression()

# fit it on our features
lr_model.fit(features_train, y_train)

# create output table
output = pandas.DataFrame({"Actual": y_test})
output["Dumb"] = True
output["LR"] = lr_model.predict(features_test)

output.head(10)

Now for the moment of truth. . .

In [None]:
for column in ["Dumb", "LR"]:
    print("Scores for model", column)
    print("Classification accuracy:", sklearn.metrics.accuracy_score(output.Actual, output[column]))
    print("F1 score:", sklearn.metrics.f1_score(output.Actual, output[column]))
    print("AUC score:", sklearn.metrics.roc_auc_score(output.Actual, output[column]))
    print("Matthews coefficient:", sklearn.metrics.matthews_corrcoef(output.Actual, output[column]))
    print()

Boom shakalaka.  Now that we have all the word features in there, our model has become quite powerful.  For more than 93% of businesses, it can correctly identify whether they are a restaurant or not.

### Re-using our train/test split

We'd now like to proceed to use the same vectorized data to predict star ratings.  But earlier, when we did the train/test split, we passed `df.IsRestaurant` as a our Y value.  How can we use the same train/test split if we're not predicting the same thing?  Will we have to redo the whole train/test split thing, which would require us to redo the vectorization?

Fortunately we can avoid that.  The reason is that, when we did our train/test split, we got back `x_train` and `x_test` values that were still DataFrames containing all our original data.  This means that `x_train` actually contains not only the X values, but also the Y value (`IsRestaurant`).  We can see that `x_train.IsRestaurant` is exactly the same as `y_train`:

In [None]:
(x_train.IsRestaurant == y_train).all()

And likewise for the test set:

In [None]:
(x_test.IsRestaurant == y_test).all()

The train/test split is only splitting the *rows* of the data: it just assigns certain reviews as training data and others as test data.  Once we've done that split, it doesn't much matter which *columns* we use as X and Y.  So by keeping all the columns in `x_train` and `x_test`, we retain the flexibility to switch our Y value later.

We also retain the flexibility of changing our X values.  For instance, we could take the vectors from CountVectorizer and add additional columns representing other data about the reviews, such as their funny/cool/useful vote counts, or even the average star rating of that reviewer's previous reviews.  For now, though, we'll continue to just use the vectorized text as our X values.

The upshot of this is that we don't need to redo the train/test split, nor do we need to refit our vectorizer.  All we have to do is, whenever we would be tempted to use `y_train`, we just use `x_train.Stars` instead, because now what we want to predict is stars (not `IsRestaurant`); and whenever we might use `y_test`, we just use `x_test.Stars`.  In fact, in our earlier code, instead of using `y_train`, we could have used `x_train.IsRestaurant`, and so on.

With that in mind, the process of creating and fitting our regression model is pretty much exactly the same as before:

In [None]:
# create the model
from sklearn.svm import LinearSVR
svr_model = LinearSVR()

# %time here is just to see how long it takes to fit the model
%time svr_model.fit(features_train, x_train.Stars)

output = pandas.DataFrame({"Actual": x_test.Stars})
output["Dumb"] = x_train.Stars.mean()
output["SVR"] = svr_model.predict(features_test)

# clip values to 1-5 as we did before
output.SVR = output.SVR.clip(1, 5)

output.head(10)

(The alert reader will notice that I changed things up a bit by using a LinearSVR model.  This is a "support vector machine", another type of model available in scikit.)

Let's see how it did. . .

In [None]:
for column in output.columns:
    print("Scores for model", column)
    print("Mean absolute error:", sklearn.metrics.mean_absolute_error(output.Actual, output[column]))
    print("Mean squared error:", sklearn.metrics.mean_squared_error(output.Actual, output[column]))
    print("Root mean squared error:", sklearn.metrics.mean_squared_error(output.Actual, output[column])**0.5)
    print()

We are now able to predict star ratings with an average error of only about 0.9 stars.  Not bad!

## Peeking inside the model again

As before, we can inspect our fitted model to get information about how it's making its decisions  Like a linear or logistic regression, this SVR model has a `.coef_` attribute that shows us the coefficients:

In [None]:
svr_model.coef_

The number of coefficients here is the same as the number of columns in our giant vectorized-text table:

In [None]:
print("Number of coefficients:", len(svr_model.coef_))
print("Table dimensions:", features_train.shape)

In order to make any sense of these coefficients, we need to match them up with the words they represent.  Remember that our vectorizer has a `.vocabulary_` attribute that shows us the mapping of words to column numbers in the vectorized table.  It also has a method `.get_feature_names()` which does the reverse: it gives us a list containing, in order, the words that correspond to each column in our table:

In [None]:
vectorizer.get_feature_names()

As usual, the most convenient thing is to combine the labels and the coefficients into a DataFrame.  We'll make a DataFrame where the words and coefficients are separate columns:

In [None]:
word_coefs = pandas.DataFrame({"Word": vectorizer.get_feature_names(), "Coef": svr_model.coef_})
word_coefs.head()

This table lets us see, roughly speaking, the "contribution" to the model's judgement from each individual word.  Also, the numerical index starting at zero, which Pandas automatically creates, handily matches up with the columns in the vectorized table, as we can see by checking the `.vocabulary_`:

In [None]:
vectorizer.vocabulary_['00am']

One of the most useful things to do with this is to sort it by `Coef`:

In [None]:
word_coefs.sort_values("Coef")

This now shows us which words move the score up or down the most.  Some are not surprising, but others may be very surprising!  One reason we may see some odd things in here is that infrequent words can have an outsized effect on the model.  For instance, if there is only one restaurant in town offering birria, but it's a really good restaurant, the word "birria" may wind up being strongly associated with positive reviews.

One way to filter out these kinds of things is to incorporate into our table the total frequency of each word.  We can get this by summing the columns of our `features_train` table.  (Sparse arrays still allow us to do things like column sums without converting to a dense array.)  The "axis=0" here tells `sum` to add up the columns rather than the rows.

In [None]:
features_train.sum(axis=0)

This value is actually not an "array" but a "matrix", which is an old datatype from numpy that may be phased out in the future.  We won't go into the details here but we can convert this to an array and get its first (and only) row like this:

In [None]:
features_train.sum(axis=0).A[0, :]

(The `.A` tells numpy to convert it to an array, and `[0, :]` means "get row 0 and all the columns".  Now that we know how to get these values, we can stick them into our coefficient DataFrame:

In [None]:
word_coefs['Freq'] = features_train.sum(axis=0).A[0, :]
word_coefs.head()

If we now sort again...

In [None]:
word_coefs.sort_values("Coef")

We can see that most of the most extreme coefficients are for words that occur fairly infrequently.  (Remember this is trained on about 20,000 reviews, so a word that occurs even, say, 100 times is pretty infrequent.)

Let's get a picture of the overall frequency distribution:

In [None]:
word_coefs.Freq.describe()

As usual with language data, there is a highly skewed distribution: 75% of words occur 9 times or fewer, but the most frequent word occurs more than 10,000 times.  We can see the most frequent words by sorting on frequency:

In [None]:
word_coefs.sort_values('Freq', ascending=False).head(10)

Let's try looking at only words that occur 10 times or more.  This represents the top 25% of most frequent words.  We'll select those rows from our table and then sort by Coef:

In [None]:
word_coefs[word_coefs.Freq >= 10].sort_values("Coef")

We see more informative results here, but still some of what appears to be "noise".  Just for fun, let's try restricting to only words that occur 100 times or more:

In [None]:
word_coefs[word_coefs.Freq >= 100].sort_values("Coef")

Here we really start to see results that we might consider meaningful.  Most of the most negative and positive words here are plausible.

This gives us some insight into how the model is working.  The overall prediction for an individual review will be based on all words it contains, both high and low frequency (apart from stop words, if we filter them out as we've done here).  This will include both common words with clear individual meanings (like *best*) as well as infrequent words whose influence on the model may be more idiosyncratic (like *yakidori*).  As we can see from the overall accuracy, the model is pretty good at integrating this information to come up with a decent overall judgement.  However, this also gives us some ideas for how we might try to improve the model, for instance by trying to reduce "noise" from words whose informativeness is due more to their rarity than to their meaning.

## Predicting scores for individual reviews

Let's now take a quick look at how we would use our model to predict the score for a particular review.  First we'll look at how to predict a particular review that's already in our table.  Let's grab a random one from our test set:

In [None]:
random_review = x_test.sample()
random_review

To predict the score for this individual review, we need to apply the vectorizer to the review text:

In [None]:
review_text = vectorizer.transform(random_review.Text)
review_text

Notice that what we now have is a sparse matrix with the same number of columns as our X and Y data, but only one row.  Now we can just call `.predict()` on this:

In [None]:
svr_model.predict(review_text)

The model predicts a score of ### stars for this review.  It actually got ### stars.  (Substitute the real numbers here; remember this will change each time we run this, since we're picking a random review each time.)

If we wanted to predict a category using our earlier model, we could do the same with `lr_model.predict(review_text)`.  This probably won't work if we run it right now, because we redid the train/test split and so the feature vectors may be different.

Now suppose instead of grabbing a review from our table, we wanted to predict the score on a new review we got somehow.  This is the situation you might face if you were using a sentiment analysis system in the real world.  For instance, you might train your model on some database, but then you want it to monitor Twitter or whatever and tell you the sentiment of new statements as they're made.  So suppose we got this review text:

In [None]:
new_review = """OMMMMMGGGGG u have 2 try Ling 111 that class is the best!!!!!  we learned about sentimental analysises and so much more!!
homework can be tough tho u better know ur python or itll be a disaster
def reecomend!"""

If we try to use our vectorizer on this. . .

In [None]:
vectorizer.transform(new_review)

It didn't work.  Notice the error message.  The thing is that the vectorizer isn't expecting a *single* document.  It expects an iterable of documents, in other words a sequence of documents, not just one document.  We can make this happen by making a list containing just our one review:

In [None]:
review_text = vectorizer.transform([new_review])
review_text

Now it worked, and we can again predict the score:

In [None]:
svr_model.predict(review_text)

In practice we'd want to create a function or pipeline that made sure to integrate this vectorization and prediction process with any pre- or post-processing we wanted to do (like clipping the scores, which wasn't done here).  However, even so, the model's prediction is fairly reasonable.

If we had multiple reviews we could do the same.  We'll add a newer review to our small "data set":

In [None]:
newer_review = """Well crap that was a nightmare.  This prof has no clue what he is doing.  He kept talking about snakes and other animals.  WTF?
Your time would be better spent just googling things.  Why bother to predict yelp reviews when you can just eat at the restaurant and see for yourself?
Wish I could give negative stars for this monstrosity."""

In [None]:
review_text = vectorizer.transform([new_review, newer_review])
review_text

In [None]:
svr_model.predict(review_text)

## Conclusion

As mentioned, I used a LinearSVR model here rather than the types I used in the first part of the tour.  One thing you'll find is that some kinds of models do not do well with the large numbers of features created by these text vectorizers; they may take a long time to run.  There are several ways to handle that.  One is to try different models.

Another approach is to tweak the vectorizer.  The `CountVectorizer` and `TfidfVectorizer` classes can be passed certain extra arguments that allow you do things like remove stopwords, set a maximum number of word features, and/or set thresholds on word features (for instance, excluding words that occur in X% of all documents, where X is a number you set, which can help to filter out common words).  You're encouraged to read [the documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text) to see how to use them.

In addition to playing around with the vectorizers, you can play around with the models themselves.  Scikit has so many kinds of models that we can't come close to addressing how to decide between them in any given situation, although [the flowchart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) is a good start.  Each type of model has different parameters you can tweak.  Sometimes they may not seem to have any effect.  Other times they may noticeably improve the model.  Sometimes you may do something that makes it way worse.  Such is life.

Improving machine-learning models is an iterative process.  It involves tweaking things again and again to try to squeeze out a little bit of extra performance, while always using a train/test split and/or cross-validation to keep yourself honest.  (It is far too easy to get good results by tweaking things to get good results on one particular test set!)

For the assignment in this section of the class, you'll just have to do some basic things along the lines we discussed here.  This material, however, is fertile ground for projects.  Just about any prediction you can dream up from the data can be fleshed out into a machine-learning pipeline that follows the same general pattern we've seen here.  And of course, there's more than just Yelp reviews!