ML Course, Bogotá, Colombia  (&copy; Josh Bloom; June 2019)

In [None]:
%run ../talktools.py

# Featurization and Pipelining

<img src="imgs/workflow.png">
Source: [V. Singh](https://www.slideshare.net/hortonworks/data-science-workshop)

<img src="imgs/feature.png">
Source: Lightsidelabs

<img src="imgs/feature2.png">

# Featurization Examples

In the real world, we are very rarely presented with a clean feature matrix. Raw data are missing, noisy, ugly and unfiltered. And sometimes we dont even have the data we need to make models and predictions.  Indeed the conversion of raw data to data that's suitable for learning on is time consuming, difficult, and where a lot of the domain understanding is required.

When we extract features from raw data (say PDF documents) we often are presented with a variety of data types:
<img src="imgs/feat.png">

# Categorical & Missing Features


Often times, we might be presented with raw data (say from an Excel spreadsheet) that looks like:

| eye color | height | country of origin | gender |
| ------------| ---------| ---------------------| ------- |
|  brown    |  1.85    |  Colombia           |     M    |
|  brown    |  1.25    |  USA                   |            |
|  blonde   |  1.45    |  Mexico               |     F     |
|  red         |  2.01    |  Mexico               |     F     |
|                |             |  Chile                   |     F     |
|  Brown   |  1.02    |  Colombia           |             |  

What do you notice in this dataset? 

Since many ML learn algorithms require, as we'll see, a full matrix of numerical input features, there's often times a lot of preprocessing work that is needed before we can learn.

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({"eye color": ["brown", "brown", "blonde", "red", None, "Brown"],
  "height": [1.85, 1.25, 1.45, 2.01, None, 1.02],
  "country of origin": ["Colombia", "USA", "Mexico", "Mexico", "Chile", "Colombia"],
  "gender": ["M", None, "F", "F","F", None]})
df

Let's first normalize the data so it's all lower case. This will handle the "Brown" and "brown" issue.

In [None]:
df_new = df.copy()
df_new["eye color"] = df_new["eye color"].str.lower()
df_new

Let's next handle the NaN in the height. What should we use here?

In [None]:
# mean of everyone?
np.nanmean(df_new["height"].values)

In [None]:
# mean of just females?
np.nanmean(df_new[df_new["gender"] == 'F']["height"]) 

In [None]:
df_new1 = df_new.copy()
df_new1.at[4, "height"] = np.nanmean(df_new[df_new["gender"] == 'F']["height"]) 
df_new1

Let's next handle the eye color. What should we use?

In [None]:
df_new1["eye color"].mode()

In [None]:
df_new2 = df_new1.copy()
df_new2.at[4, "eye color"] = df_new1["eye color"].mode().values[0]
df_new2

How should we handle the missing gender entries?

In [None]:
df_new3 = df_new2.fillna("N/A")
df_new3

We're done, right? No. We fixed the dirty, missing data problem but we still dont have a numerical feature matrix.

We could do a mapping such that "Colombia" -> 1, "USA" -> 2, ... etc. but then that would imply an ordering between what is fundamentally categories (without ordering). Instead we want to do `one-hot encoding`, where every unique value gets its own column. `pandas` as a method on DataFrames called `get_dummies` which does this for us.

In [None]:
pd.get_dummies(df_new3, prefix=['country of origin', 'eye color', 'gender'])

Note: depending on the learning algorithm you use, you may want to do `drop_first=True` in `get_dummies`.

Of course there are helpful tools that exist for us to deal with dirty, missing data.

In [None]:
%run transform

In [None]:
bt = BasicTransformer(return_df=True)
bt.fit_transform(df_new)

## Time series

The [wafer dataset](http://www.timeseriesclassification.com/description.php?Dataset=Wafer) is a set of timeseries capturing sensor measurements (1000 training examples, 6164 test examples) of one silicon wafer during the manufacture of semiconductors. Each wafer has a classification of normal or abnormal. The abnormal wafers are representative of a range of problems commonly encountered during semiconductor manufacturing.

In [None]:
import requests
from io import StringIO
dat_file = requests.get("https://github.com/zygmuntz/time-series-classification/blob/master/data/wafer/Wafer.csv?raw=true")
data = StringIO(dat_file.text)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data.seek(0)
df = pd.read_csv(data, header=None)

In [None]:
df.head()

In [None]:
df[152].value_counts()

In [None]:
## save the data as numpy arrays
target = df.values[:,152].astype(int)
time_series = df.values[:,0:152]

In [None]:
normal_inds = np.argwhere(target == 1) ; np.random.shuffle(normal_inds)
abnormal_inds = np.argwhere(target == -1); np.random.shuffle(abnormal_inds)

num_to_plot = 3
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12,6))

for i in range(num_to_plot):
    ax1.plot(time_series[normal_inds[i][0],:], label=f"#{normal_inds[i][0]}: {target[normal_inds[i][0]]}")
    ax2.plot(time_series[abnormal_inds[i][0],:], label=f"#{abnormal_inds[i][0]}: {target[abnormal_inds[i][0]]}")

ax1.legend()
ax2.legend()
ax1.set_title("Normal") ; ax2.set_title("Abnormal") 
ax1.set_xlabel("time") ; ax2.set_xlabel("time")
ax1.set_ylabel("Value")

What would be good features here?

In [None]:
f1 = np.mean(time_series, axis=1)  # how about the mean?
f1.shape

In [None]:
import seaborn as sns, numpy as np
import warnings
warnings.filterwarnings("ignore")

ax = sns.distplot(f1)

In [None]:
ax = sns.distplot(f1[normal_inds], kde_kws={"label": "normal"})
sns.distplot(f1[abnormal_inds], ax=ax, kde_kws={"label": "abnormal"})

In [None]:
f2 = np.min(time_series, axis=1)  # how about the mean?
f2.shape

In [None]:
ax = sns.distplot(f2[normal_inds], kde_kws={"label": "normal"})
sns.distplot(f2[abnormal_inds], ax=ax, kde_kws={"label": "abnormal"})

Often there are entire python packages devoted to help us build features from certain types of datasets (timeseries, text, images, movies, etc.). In the case of timeseries, a popular package is `tsfresh` (*"It automatically calculates a large number of time series characteristics, the so called features. Further the package contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks."*). See the [tsfresh docs](https://tsfresh.readthedocs.io/en/latest/) and the [list of features generated](https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html).

In [None]:
# !pip install tsfresh

In [None]:
dfc = df.copy()
del dfc[152]
d = dfc.stack()
d = d.reset_index()
d = d.rename(columns={"level_0": "id", "level_1": "time", 0: "value"})
y = df[152]

In [None]:
from tsfresh import extract_features

max_num=300

from tsfresh import extract_relevant_features

features_filtered_direct = extract_relevant_features(d[d["id"] < max_num], y.iloc[0:max_num],
                                                     column_id='id', column_sort='time', n_jobs=4)
#extracted_features = extract_features(, column_id="id", 
 #                                                           column_sort="time", disable_progressbar=False, n_jobs=3)

In [None]:
feats = features_filtered_direct[features_filtered_direct.columns[0:4]].rename(lambda x: x[0:14], axis='columns')
feats["target"] = y.iloc[0:max_num]
sns.pairplot(feats, hue="target")

# Text Data

Many applications involve parsing and understanding something about natural language, ie. speech or text data.  Categorization is a classic usage of Natural Language Processing (NLP): what bucket does this text belong to? 

Question: **What are some examples where learning on text has commerical or industrial applications?**

A classic dataset in text processing is the [20,000+ newsgroup documents corpus](http://qwone.com/~jason/20Newsgroups/). These texts taken from old discussion threads in 20 different [newgroups](https://en.wikipedia.org/wiki/Usenet_newsgroup):

<pre>
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x	
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey	
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale	
talk.politics.misc
talk.politics.guns
talk.politics.mideast	
talk.religion.misc
alt.atheism
soc.religion.christian
</pre>
One of the tasks is to assign a document to the correct group, ie. classify which group this belongs to. `sklearn` has a download facility for this dataset:

In [None]:
from sklearn.datasets import fetch_20newsgroups
news_train = fetch_20newsgroups(subset='train', categories=['sci.space','rec.autos'], data_home='datatmp/')

In [None]:
news_train.target_names

In [None]:
print(news_train.data[1])

In [None]:
news_train.target_names[news_train.target[1]]

In [None]:
autos = np.argwhere(news_train.target == 1) 
sci = np.argwhere(news_train.target == 0)

**How do you (as a human) classify text? What do you look for? How might we make these features?**

In [None]:
# total character count?
f1 = np.array([len(x) for x in news_train.data])
f1

In [None]:
ax = sns.distplot(f1[autos], kde_kws={"label": "autos"})
sns.distplot(f1[sci], ax=ax, kde_kws={"label": "sci"})
ax.set_xscale("log")
ax.set_xlabel("number of charaters")

In [None]:
# total character words?
f2 = np.array([len(x.split(" ")) for x in news_train.data])
f2

In [None]:
ax = sns.distplot(f2[autos], kde_kws={"label": "autos"})
sns.distplot(f2[sci], ax=ax, kde_kws={"label": "sci"})
ax.set_xscale("log")
ax.set_xlabel("number of words")

In [None]:
# number of questions asked or exclaimations?
f3 = np.array([x.count("?") + x.count("!") for x in news_train.data])
f3

In [None]:
ax = sns.distplot(f3[autos], kde_kws={"label": "autos"})
sns.distplot(f3[sci], ax=ax, kde_kws={"label": "sci"})
ax.set_xlabel("number of questions asked")

We've got three fairly uninformative features now. We should be able to do better. 
Unsurprisingly, what matters most in NLP is the content: the words used, the tone, the meaning from the ordering of those words. The basic components of NLP are:

 * Tokenization - intelligently splitting up words in sentences, paying attention to conjunctions, punctuation, etc.
 * Lemmization - reducing a word to its base form
 * Entity recognition - finding proper names, places, etc. in documents
 
There a many Python packages that help with NLP, including `nltk`, `textblob`, `gensim`, etc. Here we'll use the fairly modern and battletested [`spaCy`](https://spacy.io/).

In [None]:
#!pip install spacy

In [None]:
#!python -m spacy download en

In [None]:
#!python -m spacy download es

In [None]:
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en")

# the spanish model is
# nlp = spacy.load("es")

doc = nlp(u"Guido said that 'Python is one of the best languages for doing Data Science.' "
                   "Why he said that should be clear to anyone who knows Python.")

`doc` is now an `iterable ` with each word/item properly tokenized and tagged.  This is done by applying rules specific to each language.  Linguistic annotations are available as Token attributes.

In [None]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

In [None]:
from spacy import displacy

displacy.serve(doc, style="dep")

In [None]:
displacy.render(doc, style = "ent", jupyter = True)

In [None]:
nlp = spacy.load("es")

In [None]:
# https://www.elespectador.com/noticias/ciencia/decenas-de-nuevas-supernovas-ayudaran-medir-la-expansion-del-universo-articulo-863683
doc = nlp(u'En los últimos años, los investigadores comenzaron a'
                   'informar un nuevo tipo de supernovas de cinco a diez veces'
                   'más brillantes que las supernovas de Tipo "IA". ')

In [None]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

In [None]:
from spacy import displacy

displacy.serve(doc, style="dep")

In [None]:
[i for i in doc.sents]

One very powerful way to featurize text/documents is to count the frequency of words---this is called **bag of words**. Each individual token occurrence frequency is used to generate a feature. So the two sentences become:

```json
{"Guido": 1,
  "said": 2,
  "that": 2,
  "Python": 2,
  "is": 1,
  "one": 1,
  "of": 1,
  "best": 1,
  "languages": 1,
  "for": 1,
  "Data": 1,
  "Science": 1,
  "Why", 1,
  "he": 1,
  "should": 1,
  "be": 1,
  "anyone": 1,
  "who": 1
 }
 ```


A corpus of documents can be represented as a matrix with one row per document and one column per token.

Question: **What are some challenges you see with brute force BoW?**

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS

`sklearn` has a number of helper functions, include the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):

> Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using `scipy.sparse.csr_matrix`.

In [None]:
# the following is from https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

In [None]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

In [None]:
X = bow_vector.fit_transform([x.text for x in doc.sents])

In [None]:
X

In [None]:
bow_vector.get_feature_names()

Why did we get `datum` as one of our feature names?

In [None]:
X.toarray()

In [None]:
doc.text

Let's try a bigger corpus (the newsgroups):

In [None]:
news_train = fetch_20newsgroups(subset='train', 
                                                        remove=('headers', 'footers', 'quotes'),
                                                        categories=['sci.space','rec.autos'], data_home='datatmp/')

In [None]:
%time X = bow_vector.fit_transform(news_train.data)

In [None]:
X

In [None]:
bow_vector.get_feature_names()

Most of those features will only appear once and we might not want to include them (as they add noise). In order to reweight the count features into floating point values suitable for usage by a classifier it is very common to use the *tf–idf* transform. 

From [`sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer): 

> Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.


Let's keep only those terms that show up in at least 3% of the docs, but not those that show up in more than 90%.

In [None]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer, min_df=0.03, max_df=0.9, max_features=1000)

In [None]:
%time X = tfidf_vector.fit_transform(news_train.data)

In [None]:
tfidf_vector.get_feature_names()

In [None]:
X

In [None]:
print(X[1,:])

One of the challenges with BoW and TF-IDF is that we lose context. "Me gusta esta clase, no" is the same as "No me gusta esta clase". 

One way to handle this is with N-grams -- not just frequencies of individual words but of groupings of n-words. Eg. "Me guesta", "guest esta", "esta clase", "clase no", "no me" (bigrams). 

In [None]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,2))
X = bow_vector.fit_transform([x.text for x in doc.sents])
bow_vector.get_feature_names()

As we'll see later in the week, while bigram TF-IDF certainly works to capture some small scale meaning, `word embeddings` tend to do very well.