# Text feature extraction

In this section, we will learn how to extract text features from a real world dataset "the 20 newsgroups text dataset"

This toturial is a combination of following scikit-learn guides

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset

In [1]:
### Load essential packages
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer,TfidfTransformer
import numpy as np
from sklearn.linear_model import LinearRegression
### Load the dataset, it might take a while because sklearn is downloading the dataset
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

In [2]:
### All possible targets
pprint(newsgroups_train.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


The real data lies in the filenames and target attributes. The target attribute is the integer index of the category

In [3]:
print(newsgroups_train.filenames.shape)
print(newsgroups_train.target.shape)
print(newsgroups_train.target[:20])

(11314,)
(11314,)
[ 7  4  4  1 14 16 13  3  2  4  8 19  4 14  6  0  1  7 12  5]


Here's one example of the document

In [4]:
print(newsgroups_train.data[2])

From: twillis@ec.ecn.purdue.edu (Thomas E Willis)
Subject: PB questions...
Organization: Purdue University Engineering Computer Network
Distribution: usa
Lines: 36

well folks, my mac plus finally gave up the ghost this weekend after
starting life as a 512k way back in 1985.  sooo, i'm in the market for a
new machine a bit sooner than i intended to be...

i'm looking into picking up a powerbook 160 or maybe 180 and have a bunch
of questions that (hopefully) somebody can answer:

* does anybody know any dirt on when the next round of powerbook
introductions are expected?  i'd heard the 185c was supposed to make an
appearence "this summer" but haven't heard anymore on it - and since i
don't have access to macleak, i was wondering if anybody out there had
more info...

* has anybody heard rumors about price drops to the powerbook line like the
ones the duo's just went through recently?

* what's the impression of the display on the 180?  i could probably swing
a 180 if i got the 80Mb disk

It is possible to load only a sub-selection of the categories by passing the list of the categories to load to the *sklearn.datasets.fetch_20newsgroups* function:

In [5]:
subset=['comp.graphics','sci.electronics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=subset)
newsgroups_test = fetch_20newsgroups(subset='test', categories=subset)
pprint(newsgroups_train.target_names)
print(newsgroups_train.filenames.shape)
print(newsgroups_train.target.shape)
print(newsgroups_train.target[:20])

['comp.graphics', 'sci.electronics']
(1175,)
(1175,)
[1 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 0 1 1 0]


### Using scikit-learn to extract Bag-of-Words vectors

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:

- **tokenizing** strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.

- **counting** the occurrences of tokens in each document.

- **normalizing** and weighting with diminishing importance tokens that occur in the majority of samples / documents.

In this scheme, features and samples are defined as follows:

- each **individual token occurrence frequency** (normalized or not) is treated as a **feature**.

- the vector of all the token frequencies for a given **document** is considered a multivariate **sample**.

We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [6]:
### Declear the vectorizer for Bag-of-words representation
vectorizer = CountVectorizer()
vector_train = vectorizer.fit_transform(newsgroups_train.data)

print(vector_train.shape)

(1175, 20915)


In [7]:
print(vectorizer.get_feature_names())



In [8]:
### Fit a linear regression model using vectors extracted from Bag-of-Words algorithm
LR_model = LinearRegression()
LR_model.fit(vector_train,newsgroups_train.target)

LinearRegression()

In [9]:
### First, we transform test documents to vector using the same vectorizer
vector_test = vectorizer.transform(newsgroups_test.data)

### Then, predict the test vectors use the trained model
pred=LR_model.predict(vector_test)

### A simple thresholding
pred[pred<0.5]=0
pred[pred>=0.5]=1

### Compute the F1 score of predictions
from sklearn import metrics
print(metrics.f1_score(pred, newsgroups_test.target))

0.829329962073325


### Using scikit-learn to extract TF-IDF vectors


In [10]:
vectorizer = TfidfVectorizer()
vector_train = vectorizer.fit_transform(newsgroups_train.data)
print(vector_train.shape)

(1175, 20915)


In [11]:
### Fit a linear regression model using vectors extracted from TF-IDF
LR_model = LinearRegression()
LR_model.fit(vector_train,newsgroups_train.target)

LinearRegression()

In [12]:
### First, we transform test documents to vector using the same vectorizer
vector_test = vectorizer.transform(newsgroups_test.data)

### Then, predict the test vectors use the trained model
pred=LR_model.predict(vector_test)

### A simple thresholding
pred[pred<0.5]=0
pred[pred>=0.5]=1

### Compute the F1 score of predictions
from sklearn import metrics
print(metrics.f1_score(pred, newsgroups_test.target))

0.9147869674185464


### scikit-learn guide

##### supervised learning models and examples
https://scikit-learn.org/stable/supervised_learning.html

##### unsupervised learning models and examples
https://scikit-learn.org/stable/unsupervised_learning.html

##### Evaluation metrics
https://scikit-learn.org/stable/modules/model_evaluation.html