# Modeling with Scikit-Learn

In this notebook, we're going to tie everything we've learned so far to do some modeling with scikit-learn. As we've seen, we have in the violation descriptions a large number of, relatively, free text fields.

Working with text data is a particularly attractive use case for machine learning. It's also often a messy one that can involve working with a lot of boilerplate code. The library [scikit-learn](http://scikit-learn.org/stable/) provides many features for working with text data.

First, let's take a closer look at scikit-learn.

## Preliminaries: Scikit-Learn

The `scikit-learn` package provides a robust set of machine learning algorithms for Python. Like all of the packages, we have seen so far, scikit-learn is built upon the core Python scientific stack (i.e. NumPy, SciPy, Cython). One of the biggest reasons of why scikit-learn is so popular is that it has a simple, consistent API, making it useful for a wide range of statistical learning applications. The different components of scikit-learn can be combined to make powerful and expressive pipelines for analyzing data.

Scikit-learn provides facilities for

* **supervised learning** algorithms that learn from a training set with **labels**, or targets, to generalize to other inputs like **regression** and **classification**.
* **unsupervised learning** algorithms that learn structure in the data from a training set of unlabeled examples like **clustering** or **density estimators**
* **dimensionality reduction** algorithms which reduce the number of **features**, or columns, while preserving information about the data
* **model selection** for choosing the best parameters and models
* **preprocessing** for getting data ready to apply machine learning algorithms

### Representing Data in Scikit-Learn

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either **numpy** arrays, or in some cases **scipy.sparse** matrices. The size of the array is expected to be [n_samples, n_features]

* **n_samples**: The number of samples: each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.

* **n_features**: The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases.

The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where `scipy.sparse` matrices and other techniques can be useful, in that they are much more memory-efficient than numpy arrays.

### Aside: NumPy Arrays

We haven't talked much about NumPy arrays. NumPy arrays, however, are the fundamental data structure in Python data stack. 

A NumPy array is an object that represents a homogeneously typed, multidimensional array. The array provides an efficient (close to the hardware) data structure for scientific, or array-oriented, computing. First, let's look at the NumPy import convention.

In [None]:
import numpy as np

And create an array from a Python list. This is an array of all integers.

In [None]:
x = np.array([1, 2, 3, 4, 5])

We can perform indexing operations much like we saw with pandas earlier, but without the convenience of labels.

You can use regular Python slicing syntax.

In [None]:
x[:3]

In [None]:
x[::2]

Or what's called **fancy indexing** by using Boolean or integer indexes.

In [None]:
x[[True, False, True, False, True]]

We can perform operations on NumPy arrays like `sum`.

In [None]:
np.array([1, 2, 3, 4, 5]).sum()

And we can perform linear algebra operations, like taking the dot product.

In [None]:
x = np.array([[1, 2, 3], 
              [4, 5, 6],
              [7, 8, 9]])

x

In [None]:
y = np.array([[4], [5], [6]])

y

In [None]:
x.dot(y)

NumPy and the SciPy libraries also provide much more than data structures like more facilities for linear algebra, matrix decompositions, optimization, clustering, polynomials, unit testing, etc.

### Scikit-Learn Quickstart

Let's take a quick look at scikit-learn to fix ideas before going much further. We'll have a look at the canonical iris dataset, which consists of a set of measurements for flowers, each being a member of one of three species: Iris Setosa, Iris Versicolor or Iris Virginica.

In [None]:
from sklearn.datasets import load_iris

In [None]:
dta = load_iris()

The features of the data consists of

In [None]:
dta.feature_names

The labels consist of

In [None]:
dta.target_names

In [None]:
dta.data[:10]

In [None]:
dta.target[::10]

Let's fit a logistic regression model on all of the iris data.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression()

In [None]:
model.fit(dta.data, dta.target)

In [None]:
model.predict(dta.data)

### `scikit-learn` interface

The power of scikit-learn comes from the fact that they share a common, unified API, consisting of three complementary interfaces:

* **estimator** interface for building and ﬁtting models
* **predictor** interface for making predictions
* **transformer** interface for converting data.

The estimator interface is at the core of the library. It deﬁnes instantiation mechanisms of objects and exposes a fit method for learning a model from training data. All supervised and unsupervised learning algorithms (*e.g.*, for classiﬁcation, regression or clustering) are oﬀered as objects implementing this interface. Machine learning tasks like feature extraction, feature selection or dimensionality reduction are also provided as estimators.
Scikit-learn strives to have a uniform interface across all methods. For example, a typical **estimator** follows this template:

```python
class Estimator:
  
    def fit(self, X, y=None):
        """Fit model to data X (and y)"""
        self.some_attribute = self.some_fitting_method(X, y)
        return self
            
    def predict(self, X_test):
        """Make prediction based on passed features"""
        pred = self.make_prediction(X_test)
        return pred
```

For a given scikit-learn estimator object named model, several methods are available. Irrespective of the type of estimator, there will be a fit method:

* model.fit : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).

> During the fitting process, the state of the **estimator** is stored in attributes of the estimator instance named with a trailing underscore character (_). For example, the sequence of regression trees `sklearn.tree.DecisionTreeRegressor` is stored in `estimators_` attribute.

The **predictor** interface extends the notion of an estimator by adding a predict method that takes an array X_test and produces predictions based on the learned parameters of the estimator. In the case of supervised learning estimators, this method typically returns the predicted labels or values computed by the model. Some unsupervised learning estimators may also implement the predict interface, such as k-means, where the predicted values are the cluster labels.

all **supervised estimators** are expected to have the following methods:

* `model.predict` : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.
* `model.predict_proba` : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
* `model.score` : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.

Since it is common to modify or ﬁlter data before feeding it to a learning algorithm, some estimators in the library implement a **transformer** interface which deﬁnes a transform method. It takes as input some new data `X_test` and yields as output a transformed version. Preprocessing, feature selection, feature extraction and dimensionality reduction algorithms are all provided as transformers within the library.

**unsupervised estimators** will always have these methods:

* `model.transform` : given an unsupervised model, transform new data into the new basis. This also accepts one argument  X_new, and returns the new representation of the data based on the unsupervised model.
* `model.fit_transform` : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.

Let's take a look at some examples of each of these using the Chicago Health Inspection data.

In [None]:
def float_to_zip(zip_code):
    # convert from the string in the file to a float
    try:
        zip_code = float(zip_code)
    except ValueError:  # some of them are empty
        return np.nan
    
    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code

In [None]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    index_col='inspection_id',
    parse_dates=['inspection_date'],
    converters={
        'zip': float_to_zip
    },
    usecols=lambda col: col != 'location'
)

In [None]:
dta = dta.loc[~dta.violations.isnull()]

## Bag-of-Words

First, we need to take our text and turn it in to numerical features. A common assumption for doing machine learning on text is what's known as the bag of words assumption. This means that we assume that the order of the words as they occur in a document doesn't matter to discern the general meaning of the document. This is commonly done in the following steps

1. Build what's called a **vocabulary**, which is a mapping from integers to possible words, $w$, in your corpus, or collection of documents.
2. Using this vocabulary, assign a number to the count of each word occuring in any document.

What you're left with is a matrix $X$, where each value $X[i,j]$ is the count of word $j$ in document $i$.
$X$ is a matrix of dimension n_documents by n_vocabulary. This is large. Luckily, most words don't occur in every document. If they did, we would not be able to separate the documents according to topics.

For this reason, bag of words documents are often high-dimensional, sparse datasets. We don't need to keep the zeros in memory.

## Tokenize

Ok, so how do we do this? Text is often really messy, has punctuation, and has a bunch of words that every text has to have but don't necessarily connote topical meaning. These words are called stop words such as "the," "a," or "an."

We turn human writing into a set of feature vectors by taking care of these issues. This process is called tokenization.

scikit-learn provides some nice facilities for building a dictionary of features and transform documents to feature vectors. The first of these that we will look at is the **CountVectorizer** transformer.

Recall from above that a transformer is an estimator that provides a transform method.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In scikit-learn, all of the estimators take their options when you instantiate the estimator. Here, we say that we want to remove stop-words using a list of common english language words that we won't need.

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')

Then we need to *fit* the transformer on the data. Calling fit always returns the object itself. We'll see why later.

Calling a `fit` method on an estimator actually does the learning. Any learned parameters are now attached to the estimator with an underscore (`_`) appended.

In [None]:
count_vectorizer.fit(dta.violations)

In the case of `CountVectorizer` this is a dictionary called `vocabulary_` which stores a mapping from the known vocabulary to the column in the sparse matrix which contains the counts for that word. 

In [None]:
len(count_vectorizer.vocabulary_)

Finally, we need to transform our original data using transform.

In [None]:
count_matrix = count_vectorizer.transform(dta.violations)

Count matrix is a **sparse matrix**, provided by the SciPy library. The number of samples is equal to the number of violations that we have in the data. The number of columns is the cardinality of our vocabulary. The entries are the counts of each word in the document.

In [None]:
count_matrix

Sparse matrices behave a lot like plain numpy arrays. For example, we can ask for the sum of each word over all the documents.

In [None]:
count_matrix.sum(0)

We might ask, what is the most frequent word?

In [None]:
inverse_vocabulay = {v: k for k, v in count_vectorizer.vocabulary_.items()}

In [None]:
inverse_vocabulay[count_matrix.sum(0).argmax()]

This is unsurprising, since almost every violation contains the word comments.

## Tf-Idf

Looking at the most common word we already see one issue with using raw counts. Another issue is that longer documents will have higher counts of words. Commonly, we use a technique known as **term frequency - inverse document frequency**, or **tf-idf**, instead of counts to do analysis on text data, which mitigates these issues.

The *term frequency* is a measure of the frequency of a word in a document. Term frequency in document $i$ for word $j$ is

$$tf_{ij}=\frac{w_{ij}}{\sum_jw_{ij}}$$

You might go about computing this.

Another important concept is that of inverse document frequency. This is a measure of how important a word is. Words like stop words or words that are otherwise popular in a corpus will still have a high term frequency. Inverse document frequency is a way to downweight the frequent terms but upweight the rare ones. The inverse document frequency is

$$idf = \log\left(\frac{N_{\text{documents}}}{N_{\text{documents with term}}}\right)$$

or

$$idf = \log\left(\frac{N_{\text{documents}}}{1 + N_{\text{documents with term}}}\right)$$

in case your vocabulary is a superset of the words in your documents.

So tf-idf is

$$\text{tf-idf} = tf \times idf$$

Scikit-learn actually uses a slightly different definition.

Of course, scikit-learn provides a transformer for tf-idf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

Let's prepare our TfidfVectorizer. We'll remove stop-words, remove any words that don't occur in at least 100 documents and remove words that occur in 85% or more documents.

Finally, we'll use a **regular expression** pattern to determine what exactly a token (or word) is. In this case, we deviate from the scikit-learn default by not allowing numbers to be words.

In [None]:
tfidf_vect = TfidfVectorizer(
    stop_words='english', 
    min_df=50,
    max_df=.85, 
    token_pattern=r"(?u)\b[A-Za-z_][A-Za-z_]+\b"
)

Notice here that we can combine the fitting and the transformation by taking advantage of the `fit_transform` method.

In [None]:
X = tfidf_vect.fit_transform(dta.violations)

In [None]:
X

Using these more restrictive criteria above, we've greatly reduced the dimensionality of the feature space, while ideally preserving the most useful information on the contents of the documents.

## Dimensionality Reduction

TODO: introduce truncated SVD and why it's useful. Point out the transformer.

In [None]:
from sklearn.decomposition import TruncatedSVD

Mention random_state

In [None]:
n_components = 10

svd = TruncatedSVD(
    n_components=n_components, 
    random_state=0
)

Project X

In [None]:
X_reduced = svd.fit_transform(X)

In [None]:
words = np.array(sorted(tfidf_vect.vocabulary_.keys()))

In [None]:
words[:15]

Let's look at the top words in each dimension.

In [None]:
for i in range(n_components):
    idx = svd.components_[i].argsort()[::-1][:6]
    
    top_k = words[idx]
    print("{i}: {words}".format(i=i, words=top_k))

## Clustering

Normalize so that k-Means works

In [None]:
from sklearn.preprocessing import Normalizer

normalizer = Normalizer(copy=True)

In [None]:
X_norm = normalizer.fit_transform(X_reduced)

In [None]:
np.linalg.norm(X_norm, axis=1)

In [None]:
from sklearn.cluster import KMeans

n_clusters = 20

kmeans = KMeans(n_clusters=n_clusters, random_state=0)

kmeans.fit(X_norm)

In [None]:
fig, ax = plt.subplots()

ax.hist(kmeans.labels_, bins=n_clusters);

In [None]:
dta.violations[kmeans.labels_ == 0].iloc[0]

In [None]:
dta.violations[kmeans.labels_ == 0].iloc[1]

## Visualizing Clusters

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
palette = np.array(sns.color_palette("hls", n_clusters))

fig, ax = plt.subplots(figsize=(12, 8))

ax.scatter(
    tsne.embedding_[:, 0],
    tsne.embedding_[:, 1],
    lw=0,
    s=40,
    c=palette[kmeans.labels_]
)

In [None]:
words = pd.DataFrame(X.A, columns=sorted(tfidf_vect.vocabulary_.keys()))

The comments are free text.

In [None]:
words.columns

In [None]:
words.groupby(kmeans.labels_).get_group(0).mean()#nlargest(10)