# Tutorial: GFLM for implicit feature detection

## Description
This tutorial will guide you through the usage of the feature_mining package.
It contains two parts:
* quick start guide, using the default workflow; this should be enough to get you started
* a more detailed step-by-step guide, if you want to fine-tune some of the parameters

## Goal
#### Given the following:
* a text dataset
* a set of predefined features

#### Compute the following:
* mapping of explicit and implicit features on the data
* using both gflm_word and gflm_section algorithms

## Naming conventions
We will use the folowwing naming conventions:
* section = one line of the reviews data; roughly equivalent to one sentence
* pm = ParseAndModel module
* em = Expectation Maximization module
* gflm = Generative Feature Language Model

# 1. Quick start guide
Use this to jump right in detecting implicit features in a text.

**Prerequisite:**
If you did not install the feature_mining module, you can do it using the following command:
```
pip install feature_mining
```

## 1.1 Import the module and instantiate a FeatureMining object

In [None]:
import feature_mining
fm = feature_mining.FeatureMining()

## 1.2 Load the demo files
The package comes with a demo data set, based on iPod reviews.
We have already initialized a **default set of features** which will be mapped on each section of the review.

In [None]:
# Load default dataset with default feature set
fm.load_ipod(full_set=False)

## 1.3 Execute Expectation-Maximization on iPod dataset

In [None]:
## Executes Expectation-Maximization on previously loaded data.
fm.fit()

## 1.4 Compute feature mapping

In [None]:
fm.predict()

## 1.5 Inspect the results
* **gflm_word** and **gflm_section** are the values computed by gflm
* **section_id** is the section to which the value refers (the sentence)
* **implicit_feature_id** is the feature detected in the section

In [None]:
print(fm.gflm.gflm_word.head(10))
print(fm.gflm.gflm_section.tail(10))

## 1.6 Putting it together
Let's see now how these features map to their original sequences (the sentences of the reviews).

**Remark:** during this demo, we have used a subset of the review sentences;
     you can try the full dataset using **full_set=True** in load_ipod().

In [None]:
fm.section_features()
fm.gflm_section_result.sort_values(by=['gflm_section'], ascending=False)[['feature', 'section_text']].head(50)

# 2. Detailed guide
Use this procedure if you want to know more about the internal workings of the project, or if you wish to fine-tune some of its parameters.

## 2.1 Import modules

In [None]:
"""
Import feature_mining module.
Import ParseAndModel.
"""
import feature_mining
from feature_mining import ParseAndModel
from feature_mining import EmVectorByFeature
from feature_mining import GFLM
import pandas as pd
import en_core_web_sm
from pprint import pprint

## 2.2 Load the demo files

In [None]:
# Create a model based on a predefined list of features and an input data file.
import pkg_resources
filename = pkg_resources.resource_filename('feature_mining', 'data/iPod.final')
feature_list=["sound", "battery", ["screen", "display"]]

pm = ParseAndModel(feature_list=feature_list,   # list of features
                   filename = filename,         # file with input data
                   nlines=100)                  # number of lines to read

print(pm.model_results.keys())

## 2.3 Inspect the model

In [None]:
# Keys in the model dictionary
print(pm.model_results.keys())

# Language background model
print("Model background")
pprint(pm.model_results['model_background'][0:7])

# Feature model
print("Feature model")
pprint(pm.model_results['model_feature'][0][0:2])

# Word counts per section matrix (sentence/line)
print("Section word counts matrix (sentence/line) - sparse")
pprint(pm.model_results['section_word_counts_matrix'][0][0:2])

# Background model matrix - sparse
print("Background model matrix - sparse")
pprint(pm.model_results['model_background_matrix'][0][0:2])

# Feature model matrix
print("Feature model matrix")
pprint(pm.model_results['model_feature_matrix'][0:2][0:])

# Vocabulary words
print("Vocabulary words")
pprint(pm.model_results['vocabulary_lookup'][0])

## 2.4 Launch Expectation Maximization on the features

In [None]:
print("Calling EMVectorByFeature")
em = EmVectorByFeature(explicit_model=pm,
                       max_iter=30)
em.em()

## 2.5 Compute GFLM

In [None]:
gflm = GFLM(em_results=em, section_threshold=0.35, word_threshold=0.35)
gflm.calc_gflm_section()
gflm.calc_gflm_word()

print(gflm.gflm_word.head(20))
print(gflm.gflm_section.head(20))