# Tutorial: GFLM for implicit feature detection

## Description
This tutorial will guide you through the usage of the feature_mining package.
It contains two parts:
* quick start guide, using the default workflow; this should be enough to get you started
* a more detailed step-by-step guide, if you want to fine-tune some of the parameters

## Goal
#### Given the following:
* a text dataset
* a set of predefined features

#### Compute the following:
* mapping of explicit and implicit features on the data
* using both gflm_word and gflm_section algorithms

## Naming conventions
We will use the folowwing naming conventions:
* section = one line of the reviews data; roughly equivalent to one sentence
* pm = ParseAndModel module
* em = Expectation Maximization module
* gflm = Generative Feature Language Model

# 1. Quick start guide
Use this to jump right in detecting implicit features in a text.

**Prerequisite:**
If you did not install the feature_mining module, you can do it using the following command:
```
pip install feature_mining
```

## 1.1 Import the module and instantiate a FeatureMining object

In [1]:
import feature_mining
fm = feature_mining.FeatureMining()

## 1.2 Load the demo files
The package comes with a demo data set, based on iPod reviews.
We have already initialized a **default set of features** which will be mapped on each section of the review.

In [3]:
# Load default dataset with default feature set
fm.load_ipod()

loaded features:  ['sound', 'battery', ['screen', 'display'], 'storage', 'size', 'headphones', 'software', 'price', 'button']
loaded dataset: ipod


In [5]:
print(fm.pm.model_results.keys())

dict_keys(['model_background', 'model_feature', 'section_word_counts_matrix', 'model_background_matrix', 'model_feature_matrix', 'vocabulary_lookup'])


## 1.3 Execute Expectation-Maximization on iPod dataset

In [7]:
## Executes Expectation-Maximization on previously loaded data.
fm.fit()

EmVectorByFeature - base init...
EmVectorByFeature - base loop...
Maximum iterations reached
Elapsed: 0.4578 seconds


## 1.4 Compute feature mapping

In [8]:
fm.predict()

## 1.5 Inspect the results
* **gflm_word** and **gflm_section** are the values computed by gflm
* **section_id** is the section to which the value refers (the sentence)
* **implicit_feature_id** is the feature detected in the section

In [9]:
print(fm.gflm.gflm_word.head(10))
print(fm.gflm.gflm_section.tail(10))

   gflm_word  section_id  implicit_feature_id
0   0.820335          12                    0
1   0.650862          22                    0
2   0.650862          26                    0
3   0.404019          58                    0
4   0.444443          68                    0
5   0.529449         123                    0
6   0.453608         134                    0
7   0.424348         136                    0
8   0.673490         154                    0
9   0.820335         171                    0
     gflm_section  section_id  implicit_feature_id
307      0.917218         269                    8
308      0.755447         270                    8
309      0.505446         271                    8
310      0.513713         275                    8
311      0.845943         278                    8
312      0.657352         281                    8
313      0.432511         286                    8
314      0.566332         290                    8
315      0.791786         298      

# 2. Detailed guide
Use this procedure if you want to know more about the internal workings of the project, or if you wish to fine-tune some of its parameters.

## 2.1 Import modules

In [10]:
"""
Import feature_mining module.
Import ParseAndModel.
"""
import feature_mining
from feature_mining import ParseAndModel
from feature_mining import EmVectorByFeature
from feature_mining import GFLM
import pandas as pd
import en_core_web_sm
from pprint import pprint

## 2.2 Load the demo files

In [22]:
# Create a model based on a predefined list of features and an input data file.
import pkg_resources
filename = pkg_resources.resource_filename('feature_mining', 'data/iPod.final')
feature_list=["sound", "battery", ["screen", "display"]]

pm = ParseAndModel(feature_list=feature_list,   # list of features
                   filename = filename,         # file with input data
                   nlines=100)                  # number of lines to read

print(pm.model_results.keys())

dict_keys(['model_background', 'model_feature', 'section_word_counts_matrix', 'model_background_matrix', 'model_feature_matrix', 'vocabulary_lookup'])


## 2.3 Inspect the model

In [23]:
# Keys in the model dictionary
print(pm.model_results.keys())

# Language background model
print("Model background")
pprint(pm.model_results['model_background'][0:7])

# Feature model
print("Feature model")
pprint(pm.model_results['model_feature'][0][0:2])

# Word counts per section matrix (sentence/line)
print("Section word counts matrix (sentence/line) - sparse")
pprint(pm.model_results['section_word_counts_matrix'][0][0:2])

# Background model matrix - sparse
print("Background model matrix - sparse")
pprint(pm.model_results['model_background_matrix'][0][0:2])

# Feature model matrix
print("Feature model matrix")
pprint(pm.model_results['model_feature_matrix'][0:2][0:])

# Vocabulary words
print("Vocabulary words")
pprint(pm.model_results['vocabulary_lookup'][0])

dict_keys(['model_background', 'model_feature', 'section_word_counts_matrix', 'model_background_matrix', 'model_feature_matrix', 'vocabulary_lookup'])
Model background
[0.004310344827586207,
 0.004310344827586207,
 0.01293103448275862,
 0.05603448275862069,
 0.023706896551724137,
 0.0021551724137931034,
 0.017241379310344827]
Feature model
[0.0035684588810039313, 0.0035684588810039313]
Section word counts matrix (sentence/line) - sparse
<1x258 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>
Background model matrix - sparse
<1x258 sparse matrix of type '<class 'numpy.float64'>'
	with 258 stored elements in Compressed Sparse Row format>
Feature model matrix
array([[0.00356846, 0.00336429, 0.00363519],
       [0.00356846, 0.00336429, 0.00363519]])
Vocabulary words
'pleased'


## 2.4 Launch Expectation Maximization on the features

In [24]:
print("Calling EMVectorByFeature")
em = EmVectorByFeature(explicit_model=pm,
                       max_iter=30)
em.em()

Calling EMVectorByFeature
EmVectorByFeature - base init...
EmVectorByFeature - base loop...
Maximum iterations reached
Elapsed: 0.1029 seconds


## 2.5 Compute GFLM

In [25]:
gflm = GFLM(em_results=em, section_threshold=0.35, word_threshold=0.35)
gflm.calc_gflm_section()
gflm.calc_gflm_word()

print(gflm.gflm_word.head(20))
print(gflm.gflm_section.head(20))

    gflm_word  section_id  implicit_feature_id
0    0.748721          12                    0
1    0.653669          22                    0
2    0.569295          26                    0
3    0.351849          57                    0
4    0.375870          81                    0
5    0.386211          89                    0
6    0.553966          11                    1
7    0.737475          13                    1
8    0.737475          14                    1
9    0.433130          18                    1
10   0.434692          19                    1
11   0.400792          37                    1
12   0.400459          43                    1
13   0.554363          44                    1
14   0.457476          50                    1
15   0.737475          88                    1
16   0.394000           1                    2
17   0.354486           3                    2
18   0.378213           7                    2
19   0.394486          10                    2
    gflm_sect