# NLP: scikit-learn basics

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import pandas as pd

### scikit-learn API

The scikit-learn (or sklearn) package provides lots of ready-to-use models and datasets.  The algorithms are implemented in a fundamental way without much heuristic modifications, which is good for teaching.  Most importantly, it has a consistent API --- that is, the workflow to use each model are the same as follows.  

```python
from sklearn.decomposition import PCA # import the model
model = PCA(2) # select the model and setup hyperparameter
X_new = model.fit_transform(X) # fit (train) the model and make transformation / prediction
```

**Terminologies**:  

- data: a collection of information, can be numbers, texts, pictures, voices, etc.
- label: the answers corresponding to each sample, e.g., the score of a product review
- model: a function that can be used for transformation or prediction, usually created by an algorithm run on data
- fit or train: use the data to create a model
- transformation: use data to create new data
- prediction: use data to guess the answer

We will use the following data to demonstrate two algorithms, PCA and $k$-means.

In [None]:
import nltk
nltk.download('inaugural')

from nltk.corpus import inaugural

files = inaugural.fileids()
texts = [inaugural.raw(file) for file in files]
years = [file[:-4].split("-")[0] for file in files]
presidents = [file[:-4].split("-")[1] for file in files]
df = pd.DataFrame({
    "year": years,
    "president": presidents,
    "file": files,
    "text": texts
})
df.set_index("year", inplace=True)
df.tail() # print last few files

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
X = tfidf.fit_transform(texts).toarray()
print(X.shape)

### Principal component analysis (PCA)

In our data, each sample is a vector of 8984 entries.  We usually consider it as a point in an 8984-dimensional space.  However, it is almost impossible for us to see such a high-dimensional space.  We will transform the data into lower dimension, with the minimum loss of information.  

PCA is a dimensionality reduction algorithm.

In [None]:
from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)
X_new.shape

In [None]:
df['x0'] = X_new[:,0]
df['x1'] = X_new[:,1]
df.plot(kind='scatter', x='x0', y='x1', 
        color=df.index.astype(int), hover_data=['president'], 
        backend='plotly')

### $k$-means clustering

When a dataset is given, one may try to partition the data points into several clusters and predict their labels.

The $k$-means clustering algorithm will give a "reasonable" clustering label to each point.

_The code below use the same data as above and use the PCA-transformed feature to plot the points._

In [None]:
from sklearn.cluster import KMeans
model = KMeans(3)
y = model.fit_predict(X)

In [None]:
df['y'] = y.astype('object')
df.plot(kind='scatter', x='x0', y='x1', 
        color='y', hover_data=['president'],  
        backend='plotly')

### Further reading

- [_Python Data Science Handbook_](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas