# Synthetic Text Generation

In this notebook, we demonstrate how to synthesize free text columns, and will furthermore explore its quality.

For further background see also [this blog post](https://mostly.ai/blog/synthetic-data-for-text-annotation/) on "How To Scale Up Your Text Annotation Initiatives with Synthetic Text".

## Synthesize Data via MOSTLY AI

1. Download `london.csv` by clicking [here](https://github.com/mostly-ai/mostly-tutorials/raw/dev/synthetic-text/london.csv), and pressing Ctrl+S to save the file locally.

2. Synthesize `london.csv` via [MOSTLY AI](https://mostly.ai/), and configure `host_name` and `title` as Encoding Type `Text` 

3. Once the job has finished, which might take up to 1 hour, download the generated synthetic data as CSV file to your computer.

4. Upload the generated synthetic data to this Notebook via executing the next cell.

In [None]:
# upload synthetic dataset
import pandas as pd
try:
    # check whether we are in Google colab
    from google.colab import files
    print("running in COLAB mode")
    repo = 'https://github.com/mostly-ai/mostly-tutorials/raw/dev/synthetic-text'
    import io
    uploaded = files.upload()
    syn = pd.read_csv(io.BytesIO(list(uploaded.values())[0]))
    print(f"uploaded synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")
except:
    print("running in LOCAL mode")
    repo = '.'
    print("adapt `syn_file_path` to point to your generated synthetic data file")
    syn_file_path = './london-synthetic-representative.csv'
    syn = pd.read_csv(syn_file_path)
    print(f"read synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")
    
tgt = pd.read_csv(f'{repo}/london.csv')
print(f"read original data with {tgt.shape[0]:,} records and {tgt.shape[1]:,} attributes")

## Explore Synthetic Text

Show 10 randomly sampled synthetic records. Note, that you can execute the following cell multiple times, to see different samples.

In [None]:
syn.sample(n=10)

Compare this to 10 randomly sampled original records.

In [None]:
tgt.sample(n=10)

### Inspect Character Set

You will note, that the character set of the synthetic data is shorter. This is due to the privacy mechanism within the MOSTLY AI platform, where very rare tokens are being removed, to prevent that their presence give away information on the existence of individual records.

In [None]:
print('## ORIGINAL ##\n', ''.join(sorted(list(set(tgt['title'].str.cat(sep=' '))))), '\n')
print('## SYNTHETIC ##\n', ''.join(sorted(list(set(syn['title'].str.cat(sep=' '))))), '\n')

### Inspect Character Frequency

In [None]:
title_char_freq = pd.merge(
    tgt['title'].str.split('').explode().value_counts(normalize=True).to_frame('tgt').reset_index(),
    syn['title'].str.split('').explode().value_counts(normalize=True).to_frame('syn').reset_index(),
    on='index', 
    how='outer'
).rename(columns={'index': 'char'}).round(5)
title_char_freq.head(10)

In [None]:
import matplotlib.pyplot as plt
ax = title_char_freq.head(100).plot.line()
plt.title('Distribution of Char Frequencies')
plt.show()

We can see that Character Frequencies are perfectly retained.

### Inspect Term Frequency

In [None]:
import re
def sanitize(s):
    s = str(s).lower()
    s = re.sub('[\\,\\.\\)\\(\\!\\"\\:\\/]', ' ', s)
    s = re.sub('[ ]+', ' ', s)
    return s

tgt['terms'] = tgt['title'].apply(lambda x: sanitize(x)).str.split(' ')
syn['terms'] = syn['title'].apply(lambda x: sanitize(x)).str.split(' ')
    
title_term_freq = pd.merge(
    tgt['terms'].explode().value_counts(normalize=True).to_frame('tgt').reset_index(),
    syn['terms'].explode().value_counts(normalize=True).to_frame('syn').reset_index(),
    on='index', 
    how='outer'
).rename(columns={'index': 'term'}).round(5)
display(title_term_freq.head(10))
display(title_term_freq.head(200).tail(10))

In [None]:
ax = title_term_freq.head(100).plot.line()
plt.title('Distribution of Term Frequencies')
plt.show()

We can see that Term Frequencies are perfectly retained.

### Inspect Term Co-occurrence

In [None]:
def calc_conditional_probability(term1, term2):
    tgt_beds = tgt['title'][tgt['title'].str.lower().str.contains(term1).fillna(False)]
    syn_beds = syn['title'][syn['title'].str.lower().str.contains(term1).fillna(False)]
    tgt_beds_double = tgt_beds.str.lower().str.contains(term2).mean()
    syn_beds_double = syn_beds.str.lower().str.contains(term2).mean()
    print(f"{tgt_beds_double:.0%} of actual Listings, that contain `{term1}`, also contain `{term2}`")
    print(f"{syn_beds_double:.0%} of synthetic Listings, that contain `{term1}`, also contain `{term2}`")
    print("")

calc_conditional_probability('bed', 'double')
calc_conditional_probability('bed', 'king')
calc_conditional_probability('heart', 'london')
calc_conditional_probability('london', 'heart')

We can see that Term Co-occurrences are perfectly retained.

Now you might be asking yourself: if all of these characteristics are maintained, what are the chances that we'll end up with exact matches, i.e. synthetic records with the exact same `title` value as a record in the original dataset? Or even a synthetic record with the exact same values for all the columns?

Let's start by trying to find an exact match for 1 specific synthetic `title` value:

In [None]:
# find exact match for 1 specific synthetic title value. Copy a `title` value from a synthetic record into the `title_value` field below and run the cell to find an exact match in the original dataset
title_value = "Airy large double room"
tgt.loc[tgt['title'].str.contains(title_value, case=False, na=False)]

Depending on your chosen value, you may or may not find an exact match. This row-by-row validation process doesn't indicate very much and, more importantly, doesn't scale very well to the 71K rows in the dataset.

### Inspect Privacy via Exact Matches

Let's perform a simplified check for privacy, by looking for exact matches between the synthetic and the original.

For that we first split the original data into two equally-sized sets, and measure the number of matches between those two sets.

In [None]:
n = int(tgt.shape[0]/2)
pd.merge(tgt[['title']][:n].drop_duplicates(), tgt[['title']][n:].drop_duplicates())

Next, we take an equally-sized subset of the synthetic data, and again measure the number of matches between that set and the original data.

In [None]:
pd.merge(tgt[['title']][:n].drop_duplicates(), syn[['title']][:n].drop_duplicates())

We can see that exact matches between original and synthetic data can occur. However, they occur only for the most commonly used descriptions, and they do not occur more often than they occur in the original data itself.

Thus, it's important to note, that matchinig values or matching complete records are by themselves not a sign of privacy leak. They are only an issue if they occur more frequently than we would expect based on the original dataset. Also note that removing those exact matches via post-processing would have a detrimental contrary effect. The absence of a value like "Lovely single room" in a sufficiently large synthetic text corpus would in this case actually give away the fact that this sentence was present in the original. See [[1](#refs)] respectively [[2](#refs)] for more background info on this aspect.

### Analyze Price vs. Text correlation

In [None]:
tgt_term_price = tgt[['terms', 'price']].explode(column='terms').groupby('terms')['price'].median()
syn_term_price = syn[['terms', 'price']].explode(column='terms').groupby('terms')['price'].median()
def print_term_price(term):
    print(f"Median Price of actual Listings, that contain `{term}`: ${tgt_term_price[term]:.0f}")
    print(f"Median Price of synthetic Listings, that contain `{term}`: ${syn_term_price[term]:.0f}")
    print("")

print_term_price("luxury")
print_term_price("stylish")
print_term_price("cozy")
print_term_price("small")

We can see that correlations between Term occurence and the price per night, are also perfectly retained.

## Conclusion

This tutorial demonstrated how synthetic text can be generated wihtin the context of an otherwise structured dataset. We analyzed the generated texts, and validated that characters and terms occur with the same frequency, while exact matches do not occur anymore likely than within the actual text itself.

This feature thus allows to retain valuable statistical insights, typically burried away in free text columns, that remain inaccessible due to their privacy sensitive nature.

## Further exercises

In addition to walking through the above instructions, we suggest..
* analyzing further correlations, also for `host_name`
* using a different generation mood, eg. conservative sampling
* using a different dataset, eg. the Austrian First Name [[3](#refs)]

## References<a class="anchor" name="refs"></a>

1. https://github.com/mostly-ai/public-demo-data/blob/dev/firstnames_at/firstnames_at.csv.gz
1. https://www.frontiersin.org/articles/10.3389/fdata.2021.679939/full
1. https://mostly.ai/blog/truly-anonymous-synthetic-data-legal-definitions-part-ii/