# Using Latent Class Analysis to explore (synthetic) data

This notebook is about **Latent Class Analysis (LCA)**, which is a way to explore how Boolean attributes co-occur in a population. LCA is a classic technique that has been around since at least the 1980s.

I wanted to introduce LCA because:

 - I've found it useful in my work before
 - Understanding it is a stepping-stone to understanding more complicated models like Latent Dirichlet Allocation, which is a widely used technique for topic modelling (used e.g. by Converseon).
 

Here we will use an implementation of Latent Class Analysis from [here](https://github.com/dasirra/latent-class-analysis). The algorithm is about 2 or 3 screens of Python code, so it's not _rocket science_ ;-). LCA is part of a general class of algorithms called _Expectation Maximisation_ (EM) algorithms. You can think of it as a little bit like k-means if that helps.

In [1]:
from lca import LCA
import pandas as pd
from more_itertools import one

The imported function `one` is just a convenience for neatly getting one item out of a singleton list or iterator, 
and simultaneously asserting (checking) that the list has exactly one item in it.

In [2]:
one([31337])
#one([]) # errors
#one([1,2,3])

31337

We'll also want plotly.

In [5]:
from plotly import __version__ as plotly_version
from plotly.offline import init_notebook_mode, iplot

print("Plotly version: " + plotly_version)

init_notebook_mode(connected=True)         # initiate notebook for offline plot

Plotly version: 3.5.0


## Two types of people, in a synthetic data set

I've generated a synthetic dataset of 1000 rows. Let's imagine each row is a person (as it will be when we look at some real data later). We have four binary attributes: A, B, C and D.

In [6]:
data_df = pd.read_csv("synth_data.csv")
data_df.head(10)

Unnamed: 0,A,B,C,D
0,0,0,1,0
1,1,0,0,1
2,1,1,0,0
3,0,0,1,1
4,1,0,0,0
5,1,1,0,0
6,1,1,0,1
7,0,0,1,0
8,1,1,0,1
9,0,0,1,0


We would like to know what types of people we generally have in the data, with respect to attributes A, B, C and D. Because there are only 4 attributes, we can count all the combinations:

In [7]:
data_df.groupby(['A', 'B', 'C', 'D']).size().to_frame('Count').reset_index().sort_values('Count', ascending=False)

Unnamed: 0,A,B,C,D,Count
12,1,1,0,0,200
3,0,0,1,1,195
13,1,1,0,1,183
2,0,0,1,0,182
10,1,0,1,0,34
8,1,0,0,0,29
5,0,1,0,1,28
4,0,1,0,0,23
11,1,0,1,1,21
1,0,0,0,1,20


We can see that we tend to have one of the following:
 - A and B, but not C
 - Neither A nor B, but C

The D attribute just does it's own merry thing.

## Running LCA on the synthetic data

We can throw this data to LCA and ask it to find two classes i.e. two types of people, in a way that _best explains the actual data observed_.

The idea is that the population is composed of a number of different classes of people, and people from different classes tend to have different attribute values. But we can't observe the classes of people directly (they are "Latent"); we can only infer their existence from the patterns of the attributes.

We need to pass the data in as a numpy matrix.

In [8]:
matrix = data_df.values

lca = LCA(n_components=2)
lca.fit(matrix)

print("Finished finding latent classes.")

Finished finding latent classes.


The first things we get as output from LCA are the probabilities of someone in each group having each of the attributes; we get these from the `theta` attribute of the LCA object, which is an array with 4 columns (for the 4 attributes) and a row for each latent class.

In [9]:
chart_data = []

for i in range(2):
    chart_data.append({
        'x': ['A', 'B', 'C', 'D'],
        'y': lca.theta[i, :],
        'type': 'bar',
        'name': "Class %d" % i
    })

figure = {
    'data': chart_data,
    'layout': {'yaxis': {'title': 'Probability of attribute'}}
}

iplot(figure)

This is **great**: LCA has told us there's:
 - One group of people who are very likely to have A, very likely to have B, very unlikely to have C and about even chance of having D
 - Another group of people who are very unlikely to have A, very unlikely to have B, but very likely to have C and about even chance of having D
 
LCA will also give us an estimate of the prevalence of each class in the data (in this case roughly half and half):

In [10]:
lca.weight

array([0.50715707, 0.49284293])

The next thing we can do is get the LCA model to tell us which class a particular person is likely to belong to. We can get a "hard assignment" (just one predicted class) or a "soft assignment" (a probability of belonging to each class).

In [11]:
person = data_df.loc[2]

print(person)

print("Hard assignment:")
print(lca.predict([person.values]))

print("Soft assignment:")
for i, prob in enumerate(one(lca.predict_proba([person.values]))):
    print("Probability for class %d: %0.3f" % (i, prob))

A    1
B    1
C    0
D    0
Name: 2, dtype: int64
Hard assignment:
[0]
Soft assignment:
Probability for class 0: 0.999
Probability for class 1: 0.001


In [12]:
person = data_df.loc[3]

print(person)

print("Hard assignment:")
print(lca.predict([person.values]))

print("Soft assignment:")
for i, prob in enumerate(one(lca.predict_proba([person.values]))):
    print("Probability for class %d: %0.3f" % (i, prob))

A    0
B    0
C    1
D    1
Name: 3, dtype: int64
Hard assignment:
[1]
Soft assignment:
Probability for class 0: 0.001
Probability for class 1: 0.999


In [13]:
person = data_df.loc[4]

print(person)

print("Hard assignment:")
print(lca.predict([person.values]))

print("Soft assignment:")
for i, prob in enumerate(one(lca.predict_proba([person.values]))):
    print("Probability for class %d: %0.3f" % (i, prob))

A    1
B    0
C    0
D    0
Name: 4, dtype: int64
Hard assignment:
[0]
Soft assignment:
Probability for class 0: 0.919
Probability for class 1: 0.081
