# PyPremise Example: General Usage

PyPremise identifies the systematic differences between two groups of text. So let's start with two lists of texts, one about cats and one about dogs. In PyPremise, we call each text an instance.

In [1]:
group_0_texts = ["I like an independant cat .", 
                 "The cat I have is independant .", 
                "That's a cute cat .",
                "Wow, your cat is really independant .",
                "All voted for the independant cat .",
                "I have a very independant cat .",  
                "Who knew that a cat would be so independant .",
                "I like that animal that catches mice .",
                "I like the cat because it is independant .",
                "Your cat is so independant and so cute .",
                "I have a very independant cat .",
                "Well, independant is probably the right adjective to describe this cat ."]

group_1_texts = ["My dog is incredibly loyal .",
                "A dog is famously a man's best friend .",
                "The dog ate my homework .",
                "The dog likes to chase the rabbit .",
                "Wow, your dog is really loyal and cute .",
                "The dog had a beautiful color .",
                "Being very loyal, the dog helped her owner all the time .",
                "I'm a big fan of this dog ."]

You might have spotted that the "." is always separated at the end. PyPremise assumes that your data is already tokenized, i.e. separated into individual words or tokens. In this tutorial, we make our life easy and just split at every whitespace.

For more advanced tokenization, you can use a library like [spaCy](https://spacy.io/api/tokenizer) to perform the tokenization. Usually, you should tokenize your texts before giving them to PyPremise as this reduces the vocabulary and makes it more likely to find patterns.

In [2]:
# simple whitespace tokenizer
def tokenizer(texts):
    return [text.split() for text in texts]

tok_group_0_texts = tokenizer(group_0_texts)
tok_group_1_texts = tokenizer(group_1_texts)
print(tok_group_0_texts)

[['I', 'like', 'an', 'independant', 'cat', '.'], ['The', 'cat', 'I', 'have', 'is', 'independant', '.'], ["That's", 'a', 'cute', 'cat', '.'], ['Wow,', 'your', 'cat', 'is', 'really', 'independant', '.'], ['All', 'voted', 'for', 'the', 'independant', 'cat', '.'], ['I', 'have', 'a', 'very', 'independant', 'cat', '.'], ['Who', 'knew', 'that', 'a', 'cat', 'would', 'be', 'so', 'independant', '.'], ['I', 'like', 'that', 'animal', 'that', 'catches', 'mice', '.'], ['I', 'like', 'the', 'cat', 'because', 'it', 'is', 'independant', '.'], ['Your', 'cat', 'is', 'so', 'independant', 'and', 'so', 'cute', '.'], ['I', 'have', 'a', 'very', 'independant', 'cat', '.'], ['Well,', 'independant', 'is', 'probably', 'the', 'right', 'adjective', 'to', 'describe', 'this', 'cat', '.']]


Now we convert our lists of tokenized texts into the format that PyPremise internally works with. 

If you are interested in the details: PyPremise uses a bag-of-word assumptions and a numeric representation where each token is represented by its vocabulary index. If that sounds confusing to you, don't worry, we have a data_loaders helper function that takes care of that:

In [3]:
from pypremise import data_loaders
premise_instances, voc_token_to_index, voc_index_to_token = data_loaders.from_token_lists(tok_group_0_texts, tok_group_1_texts)

Now let's run the Premise algorithm and let's identify what the differences between the two groups of texts are!

In [4]:
from pypremise import Premise
premise = Premise(voc_index_to_token=voc_index_to_token)

patterns = premise.find_patterns(premise_instances)
for p in patterns:
    print(p)

(dog) towards group 1 (Instances: 0 in group 0, 8 in group 1)
(independant) and (cat) towards group 0 (Instances: 10 in group 0, 0 in group 1)


PyPremise uses a data mining approach and requires a certain amount of data to find statistically significant results. That's why we have so many individual sentences about cats and dogs. You can play around with the number of sentences, add some or remove some. How many examples PyPremise needs to identify a pattern depends on a lot of factors, including noise (e.g. a "cat" also appearing in the dog group 1 like in the example). If you do not get any outputs from PyPremise, try increasing the number of instances. Usually, 10 or 20 instances per group is the bare minimum and it starts getting interesting with 100 or more instances.