<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/langchain/use_cases/Langchain_OpenAI_Use_cases_Syntethic_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain

LangChain is a framework for developing applications powered by language models.

https://python.langchain.com/docs/use_cases

## Langchain Synthetic Data
Synthetic data is artificially generated data, rather than data collected from real-world events. It's used to simulate real data without compromising privacy or encountering real-world limitations.

Benefits of Synthetic Data:

Privacy and Security: No real personal data at risk of breaches.
Data Augmentation: Expands datasets for machine learning.
Flexibility: Create specific or rare scenarios.
Cost-effective: Often cheaper than real-world data collection.
Regulatory Compliance: Helps navigate strict data protection laws.
Model Robustness: Can lead to better generalizing AI models.
Rapid Prototyping: Enables quick testing without real data.
Controlled Experimentation: Simulate specific conditions.
Access to Data: Alternative when real data isn't available.

https://python.langchain.com/docs/get_started/introduction

https://python.langchain.com/docs/use_cases/data_generation

https://python.langchain.com/docs/modules/model_io/prompts/

https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter

https://api.python.langchain.com/en/latest/experimental_api_reference.html

## Presidio
https://microsoft.github.io/presidio/

https://spacy.io/

In [1]:
! pip install langchain langchain-community tiktoken -q
! pip install -U accelerate -q
! pip install -U unstructured numpy -q
! pip install openai chromadb beautifulsoup4 -q
! pip install presidio_analyzer presidio_anonymizer
! python -m spacy download en_core_web_lg

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.1/269.1 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.5/138.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [6]:
! pip install langchain_experimental langchain-openai -q

In [2]:

from google.colab import output
output.enable_custom_widget_manager()

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
from google.colab import userdata
openai_api_key = userdata.get('KEY_OPENAI')


In [12]:
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)
from langchain_openai import ChatOpenAI

In [14]:
class PII_entities(BaseModel):
    PERSON: str
    LOCATION: str
    CREDIT_CARD: str
    EMAIL_ADDRESS: str
    IP_ADDRESS: str
    IBAN_CODE: str

In [16]:
examples = [
    {
        "example": """PERSON: John Wick, LOCATION: New York, CREDIT_CARD: 4095-2609-9393-4932
        , EMAIL_ADDRESS: lapalma@google.com, IP_ADDRESS: 172.110.1.2, IBAN_CODE: IL150120690000003111111"""
    },
    {
        "example": """PERSON: Enma Watson, LOCATION: London, CREDIT_CARD: 2839 8495 7764 6377
        , EMAIL_ADDRESS: ewatson@yahoo.com, IP_ADDRESS: 192.168.2.18, IBAN_CODE: BE85 0126 5388 8999"""
    },
    {
       "example":  """PERSON: Carlos Espinosa Wick, LOCATION: Mexico Df, CREDIT_CARD: 2539 3519 2345 1555
        , EMAIL_ADDRESS: carlosespinosa@outlook.com, IP_ADDRESS: 10.190.1.20, IBAN_CODE: GB88STBU122345456789107"""
    },
]

# Prompt Template

https://python.langchain.com/docs/modules/model_io/prompts/few_shot_examples

https://api.python.langchain.com/en/latest/tabular_synthetic_data/langchain_experimental.tabular_synthetic_data.openai.create_openai_data_generator.html



In [17]:
OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples,
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

In [18]:
synthetic_data_generator = create_openai_data_generator(
    output_schema=PII_entities,
    llm=ChatOpenAI(
        model="gpt-3.5-turbo", temperature=1, openai_api_key=openai_api_key
    ),  # You'll need to replace with your actual Language Model instance
    prompt=prompt_template,
)

In [19]:
synthetic_results = synthetic_data_generator.generate(
    subject="PII entities",
    extra="""the name must be chosen at random. Make it something you wouldn't normally choose. Credit Card numbers must indicate the provider:
    Mastercard numbers start with a 2 or 5. Visa card numbers start with a 4. American Express numbers start with a 3. """,
    runs=10,
)

In [35]:
synthetic_results

[PII_entities(PERSON='Jane Smith', LOCATION='Los Angeles', CREDIT_CARD='5273-8436-9291-5482', EMAIL_ADDRESS='janesmith@example.com', IP_ADDRESS='124.56.78.90', IBAN_CODE='DE89370400440532013000'),
 PII_entities(PERSON='Samantha Rodriguez', LOCATION='New York', CREDIT_CARD='4532 7845 3928 1167', EMAIL_ADDRESS='srodriguez@gmail.com', IP_ADDRESS='203.145.22.77', IBAN_CODE='FR76 3000 9000 0111'),
 PII_entities(PERSON='Elliot Green', LOCATION='Paris', CREDIT_CARD='4152 7430 1826 9963', EMAIL_ADDRESS='elliot.green@example.com', IP_ADDRESS='192.168.1.10', IBAN_CODE='GB19LOYD91492874915682'),
 PII_entities(PERSON='Alice Johnson', LOCATION='Tokyo', CREDIT_CARD='3126 7464 9182 4837', EMAIL_ADDRESS='alice.johnson@example.com', IP_ADDRESS='210.45.67.89', IBAN_CODE='ES66378654530482917344'),
 PII_entities(PERSON='John Smith', LOCATION='Los Angeles', CREDIT_CARD='5156 7890 2345 6789', EMAIL_ADDRESS='john.smith@example.com', IP_ADDRESS='172.16.254.1', IBAN_CODE='DE89370400440532013000'),
 PII_entitie

# Create Dataset for Synthetic data


https://api.python.langchain.com/en/latest/tabular_synthetic_data/langchain_experimental.tabular_synthetic_data.base.SyntheticDataGenerator.html



In [46]:
from langchain_experimental.synthetic_data import (
    DatasetGenerator,

)
# LLM
model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7, openai_api_key=openai_api_key)


In [36]:
imp= []
for r in synthetic_results:

  data= {
        "PERSON": r.PERSON,
        "LOCATION":r.LOCATION,
        "CREDIT_CARD": r.CREDIT_CARD,
        "EMAIL_ADDRESS": r.EMAIL_ADDRESS,
        "IP_ADDRESS": r.IP_ADDRESS,
        "IBAN_CODE": r.IBAN_CODE
    }
  imp.append(data)

In [38]:
# Example input for generating synthetic customer profiles
imp[-1]

{'PERSON': 'Astrid Montgomery',
 'LOCATION': 'Sydney',
 'CREDIT_CARD': '4716 2890 5678 1234 (Visa)',
 'EMAIL_ADDRESS': 'astrid.montgomery@example.com',
 'IP_ADDRESS': '210.56.78.90',
 'IBAN_CODE': 'AU54321987654321098765'}

In [42]:
generator = DatasetGenerator(model, {"style": "formal", "minimal length": 500})
dataset = generator(imp)

In [44]:
dataset[-1]

{'fields': {'PERSON': 'Astrid Montgomery',
  'LOCATION': 'Sydney',
  'CREDIT_CARD': '4716 2890 5678 1234 (Visa)',
  'EMAIL_ADDRESS': 'astrid.montgomery@example.com',
  'IP_ADDRESS': '210.56.78.90',
  'IBAN_CODE': 'AU54321987654321098765'},
 'preferences': {'style': 'formal', 'minimal length': 500},
 'text': 'Ms. Astrid Montgomery, residing in the vibrant city of Sydney, can be reached at her formal email address, astrid.montgomery@example.com. For any financial transactions, her Visa credit card number is 4716 2890 5678 1234, and her IBAN code is AU54321987654321098765. Additionally, her IP address is 210.56.78.90.'}

In [78]:
dataset[-1]['text']

'Ms. Astrid Montgomery, residing in the vibrant city of Sydney, can be reached at her formal email address, astrid.montgomery@example.com. For any financial transactions, her Visa credit card number is 4716 2890 5678 1234, and her IBAN code is AU54321987654321098765. Additionally, her IP address is 210.56.78.90.'

# Parsers Extraction

https://python.langchain.com/docs/modules/model_io/output_parsers/types/pydantic


In [47]:
# Parsers

In [64]:
from langchain.output_parsers import PydanticOutputParser

from langchain_openai import OpenAI

In [70]:
llm = OpenAI(model_name="gpt-3.5-turbo-instruct",openai_api_key=openai_api_key)

In [71]:

parser = PydanticOutputParser(pydantic_object=PII_entities)

prompt = PromptTemplate(
    template="Extract fields from a given text.\n{format_instructions}\n{text}\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

In [77]:
_input = prompt.format_prompt(text=dataset[0]["text"])
output = llm(_input.to_string())
parsed = parser.parse(output)
print(parsed)
print(dataset[0]["text"])

PERSON='Jane Smith' LOCATION='Los Angeles' CREDIT_CARD='5273-8436-9291-5482' EMAIL_ADDRESS='janesmith@example.com' IP_ADDRESS='124.56.78.90' IBAN_CODE='DE89370400440532013000'
Ms. Jane Smith, a resident of Los Angeles, can be reached at her email address, janesmith@example.com. She can also be reached via her IP address, 124.56.78.90. Additionally, her credit card number is 5273-8436-9291-5482 and her IBAN code is DE89370400440532013000.


# PII Detection

In [60]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
import pprint
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()



In [62]:
import pprint

In [63]:
for d in dataset:
  sample =d['text']
  results = analyzer.analyze(sample, language="en")
  anonymized = anonymizer.anonymize(text=sample, analyzer_results=results)
  anonymized_text = anonymized.text
  pprint.pprint(sample)
  pprint.pprint(anonymized_text)
  print("-"*50)


('Ms. Jane Smith, a resident of Los Angeles, can be reached at her email '
 'address, janesmith@example.com. She can also be reached via her IP address, '
 '124.56.78.90. Additionally, her credit card number is 5273-8436-9291-5482 '
 'and her IBAN code is DE89370400440532013000.')
('Ms. <PERSON>, a resident of <LOCATION>, can be reached at her email address, '
 '<EMAIL_ADDRESS>. She can also be reached via her IP address, <IP_ADDRESS>. '
 'Additionally, her credit card number is <IN_PAN>9291-5482 and her IBAN code '
 'is <IBAN_CODE>.')
--------------------------------------------------
('Ms. Samantha Rodriguez, a resident of New York, can be reached at her email '
 'address srodriguez@gmail.com or contacted by phone at her French IBAN code '
 'FR76 3000 9000 0111. Her credit card number is 4532 7845 3928 1167. '
 'Additionally, her online activity can be traced back to her IP address '
 '203.145.22.77.')
('Ms. <PERSON>, a resident of <LOCATION>, can be reached at her email address '
 '

In [79]:
results

[type: EMAIL_ADDRESS, start: 107, end: 136, score: 1.0,
 type: IP_ADDRESS, start: 299, end: 311, score: 0.95,
 type: PERSON, start: 4, end: 21, score: 0.85,
 type: LOCATION, start: 55, end: 61, score: 0.85,
 type: URL, start: 107, end: 116, score: 0.5,
 type: URL, start: 125, end: 136, score: 0.5,
 type: IN_PAN, start: 11, end: 21, score: 0.05,
 type: IN_PAN, start: 114, end: 124, score: 0.05]