# NLP: scikit-learn basics

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/phonchi/ModularPython/blob/master/NLP-scikit_learn_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/phonchi/ModularPython/blob/master/NLP-scikit_learn_basics.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [1]:
import numpy as np
import pandas as pd

### Scikit-learn API

### Utilizing Scikit-learn for Machine Learning Models

The `scikit-learn` (or `sklearn`) library is a powerful tool for machine learning, offering a wide range of ready-to-use models and datasets. The algorithms are implemented in a straightforward manner with minimal heuristic modifications, making it ideal for educational purposes. One of the key strengths of `scikit-learn` is its consistent API, which simplifies the process of using different models. The typical workflow for utilizing a model in `scikit-learn` can be summarized as follows:

```python
from sklearn.decomposition import PCA  # Import the model
model = PCA(2)              # Initialize the model with hyperparameters
X_new = model.fit_transform(X)      # Fit the model to the data and apply transformation
```

**Terminologies**:
- **Data:** A collection of information that may be numerical, textual, visual, or auditory.
- **Label:** The target output for each data sample, such as a product review score.
- **Model:** A mathematical function used for making predictions or transformations, derived from applying an algorithm to data.
- **Fit or Train:** The process of adjusting the model's parameters based on the data.
- **Transformation:** The conversion of data into a new format or structure using the model.
- **Prediction:** Estimating outputs based on input data using the model.

We will use the following data to demonstrate two algorithms, PCA and $k$-means.

In [2]:
import nltk
nltk.download('inaugural')

from nltk.corpus import inaugural

files = inaugural.fileids()
texts = [inaugural.raw(file) for file in files]
years = [file[:-4].split("-")[0] for file in files]
presidents = [file[:-4].split("-")[1] for file in files]
df = pd.DataFrame({
    "year": years,
    "president": presidents,
    "file": files,
    "text": texts
})
df.set_index("year", inplace=True)
df.tail() # print last few files

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.


Unnamed: 0_level_0,president,file,text
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2005,Bush,2005-Bush.txt,"Vice President Cheney, Mr. Chief Justice, Pres..."
2009,Obama,2009-Obama.txt,My fellow citizens:\n\nI stand here today humb...
2013,Obama,2013-Obama.txt,Thank you. Thank you so much.\n\nVice Presiden...
2017,Trump,2017-Trump.txt,"Chief Justice Roberts, President Carter, Presi..."
2021,Biden,2021-Biden.txt,"Chief Justice Roberts, Vice President Harris, ..."


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
X = tfidf.fit_transform(texts).toarray()
print(X.shape)

(59, 8984)


### Principal component analysis (PCA)

In our data, each sample is a vector of 8984 entries.  We usually consider it as a point in an 8984-dimensional space.  However, it is almost impossible for us to see such a high-dimensional space.  We will transform the data into lower dimension, with the minimum loss of information.  

PCA is a dimensionality reduction algorithm.

In [4]:
from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)
X_new.shape

(59, 2)

In [5]:
df['x0'] = X_new[:,0]
df['x1'] = X_new[:,1]
df.plot(kind='scatter', x='x0', y='x1',
    color=df.index.astype(int), hover_data=['president'],
    backend='plotly')

### $k$-means clustering

When a dataset is given, one may try to partition the data points into several clusters and predict their labels.

The $k$-means clustering algorithm will give a "reasonable" clustering label to each point.

_The code below use the same data as above and use the PCA-transformed feature to plot the points._

In [6]:
from sklearn.cluster import KMeans
model = KMeans(3)
y = model.fit_predict(X)





In [7]:
df['y'] = y.astype('object')
df.plot(kind='scatter', x='x0', y='x1',
        color='y', hover_data=['president'],
        backend='plotly')

### 📚 Further reading

- [_Python Data Science Handbook_](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas