### industryGPT - Client MangoPay

This is a python notebook to illustrate the system used for client MangoPay to classify potential prospect companies. The way to do this is by enriching the basic set of useful data points to filter potential target accounts. 

Here is the data points enriched in this specific demo:
- **Platform Classification**: Classify within a predefined set of types of platforms.
- **Client Focus**: Subset of customer business models does the company target.

#### Import Libraries

In [105]:
from openai import OpenAI
from bs4 import BeautifulSoup
import json
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from zenrows import ZenRowsClient
import os.path
import requests

#### OpenAI functions

In [106]:
client = OpenAI()

def generate_response(prompt):
    response = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_format={ "type": "json_object" },
    messages=[
        {"role": "system", "content": "You are a helpful assistant, expert in Analysing Companies."},
        {"role": "user", "content": str(prompt)},
    ],
    temperature=0
    )
    selection = response.choices[0].message.content
    return selection

def generate_response_feedback(initial_question, system_answer, feedback):
    response = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_format={ "type": "json_object" },
    messages=[
        {"role": "system", "content": "You are a helpful assistant, expert in Analysing Companies."},
        {"role": "user", "content": str(initial_question)},
        {"role": "system", "content": str(system_answer)},
        {"role": "user", "content": str(feedback)}
    ],
    temperature=0
    )
    selection = response.choices[0].message.content
    return selection

def generate_response_gpt3(prompt):
    response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant, expert in Analysing Companies."},
        {"role": "user", "content": str(prompt)},
    ],
    max_tokens=1000,
    temperature=0
    )
    selection = response.choices[0].message.content
    return selection

def generate_response_gpt3_json(prompt):
    response = client.chat.completions.create(
    model="gpt-3.5-turbo-1106",
    response_format={ "type": "json_object" },
    messages=[
        {"role": "system", "content": "You are a helpful assistant, expert in Analysing Companies."},
        {"role": "user", "content": str(prompt)},
    ],
    max_tokens=1000,
    temperature=0
    )
    selection = response.choices[0].message.content
    return selection

#### Retrieve HTML

In [107]:
def retrieve_html(url):
    client = ZenRowsClient("9fd8cbf3ac82dc40ade5d731997880b9e4561a32")

    params = {"block_resources": "image,media,font"}

    try:
        response = client.get(url, params=params)
    
    except Exception:
        params = {
            "block_resources": "image,media,font",
            "premium_proxy":"true",
            "resolve_captcha": "true"
            }

    soup = BeautifulSoup(response.text, "html.parser")

    # Process the HTML as before
    for script in soup(["script", "style"]):
        script.extract()

    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text

#### Google search query

In [108]:
def selenium_search(query):
    # Replace these with your own API key and Custom Search Engine ID
    API_KEY = os.environ['GOOGLE_SEARCH_API_KEY']
    SEARCH_ENGINE_ID = os.environ['SEARCH_ENGINE_ID']

    # Base URL for Google Custom Search
    url = 'https://www.googleapis.com/customsearch/v1'

    # Parameters for the search
    params = {
        'key': API_KEY,
        'cx': SEARCH_ENGINE_ID,
        'q': query
    }

    # Send the GET request
    response = requests.get(url, params=params)

    # Check if the request was successful
    if response.status_code == 200:
        results = response.json()
        # Return the first search result
        if 'items' in results:
            return results['items'][0]['link']
        else:
            return 'No results found'
    else:
        return f'Error: {response.status_code}'

#### Client focus prompt

In [109]:
# Client focus prompt
clients_focus = ["B2C", "B2B", "C2C", "B2B/B2C"]
prompt_client_focus = """

You are given the description of a company retrieved from it's website.

You will need to categorize the company within a Client Focus according to the taxonomy below.

There can only be ONE client focus. The answer needs to be concise and only can follow the taxonomy.

DO NOT come up with any taxonomy. STICK to the taxonomy below. 

Give your results in the format JSON  - 

'client_focus': [client focus] 


Here the taxonomy for the client focus:
B2C
B2B
C2C
B2B/B2C

Here a decription for each category:
B2C : Business to Consumer Model

B2B : Business to Business Model

C2C : Customer to Customer

B2B/B2C : Both Business to Business and Business to Consumer

"""

#### Platform Prompt

In [110]:
platform_tags = ["On-demand Platform", "Financial Platforms", "Product Marketplace", "Not Classified"]

prompt_marketplace = """
You are given the description of a company retrieved from a company's website.

You will need to categorize the company within an format of sales category according to the taxonomy below.

"On-demand platform" or "Product Marketplace" or "Financial Platforms" or ""Not Classified"

- Product Marketplace: A virtual commerce space where multiple vendors list and sell tangible goods to consumers or businesses. This can include marketplaces with a broad variety of products or niche sites specializing in specific categories, such as crafts or luxury watches. 
Examples include Amazon (general marketplace), Vinted (second-hand clothes), Etsy (handmade and vintage items), Chrono24 (luxury watches), Farfetch (fashion), Decathlon (sports equipment), Vestas (wind turbines), and Houzz (home remodeling and interior design).

- Financial Platforms: Digital systems supporting monetary services, including traditional banking, investment, lending, crowdfunding, insurtech, and innovative financial solutions. 
Any application that involves financial operations as their main point of focus should be classified here.
These platforms may offer personal and commercial banking products, investment opportunities, insurance policies, and technologies that facilitate financial transactions. 
Examples encompass Seedrs (equity crowdfunding), Ulule (crowdfunding), Homunity.com (real estate crowdfunding), Klarna (buy now, pay later services), Allianz (insurance), Lloyds Bank (banking), Alma (payment solutions), and Revolut (digital banking).

- On-Demand Platforms: Interactive, often mobile-accessible applications or websites that connect users with providers delivering immediate services across various sectors such as transportation, manufacturing, food delivery, home services, real estate and wellness. 
These platforms usually facilitate transactions and scheduling between the customer and service provider in real time. The service provider can be the company itself too. Examples include Deliveroo (food delivery), Uber (ride-hailing and food delivery), TaskRabbit (handyman and errand services), BlaBlaCar (carpooling), Treatwell (beauty and wellness bookings), and Cargo.one (air cargo services).

- Not Classified: Entities that do not align with the aforementioned categories. This includes organizations whose core operations do not revolve around selling products, providing financial services, connecting service providers with consumers in an on-demand fashion. 

There can only be ONE category. The answer needs to be concise and only can follow the taxonomy.

DO NOT come up with any taxonomy. STICK to the taxonomy below. 

Give your results in the format format JSON - 

'platform_tag': "On-demand Platform" or "Financial Platforms" or "Product Marketplace" or "Not Classified"

"""

In [111]:
# Company description prompt
organize_prompt = '''From the following scrapped text of a website explain what the business does. The text will be poorly written so take that in mind. 
Write everything in third person naming the company. Output only a description using key words of the industry. Focus on understanding what their business model is and how they operate. 

Here the scrapped code/text: '''

header_url= "Here the description of the company deducted from their website: "


In [112]:
# Helper functions
def truncate_string(input_string):
    if len(input_string) <= 3000:
        return input_string
    else:
        return input_string[:3000]

def format_url(url):
    if url.startswith("http://www."):
        url = url.replace("http://www.", "https://www.")
    elif url.startswith("http://"):
        url = url.replace("http://", "https://www.")
    elif url.startswith("www."):
        url = "https://" + url
    elif not url.startswith("https://"):
        url = "https://www." + url
    return url

In [113]:
# IndustryGPT 
exec = ThreadPoolExecutor(12)

def industryGPT(name, url, index):
    print('Enriching: ', name)
    print('With URL: ', url)

    full_response = {
            "index": index,
            "company_profile": {
                "name": str(name),
                "website": str(url),
                "client_focus": None,
                "platform_tag": None,
                "description": None
            }

        }
    
    try:
        url_text = retrieve_html(format_url(url))
        scrapped_code = url_text
        print('\n-> Retrieved text from website...')
    except Exception as e:
        print('Unaccessible URL as per error ->', e)
        print('Searching from Google.')
        new_url = selenium_search(format_url(url))
        url_text = retrieve_html(format_url(new_url))

    if len(url_text) > 30:
        print('-> Crafting company description...')
        description_openai = generate_response_gpt3(organize_prompt + truncate_string(url_text))
        print('\nDescription of the company: ', description_openai)

    else:
        print('Scrapped website or extra descriptors have less than 100 characters...')
        print('Trying to search on google a new page...')
        new_url = selenium_search(format_url(url))
        url_text = retrieve_html(new_url)
        description_openai = generate_response_gpt3(organize_prompt + truncate_string(url_text))
        print('\nDescription of the company: ', description_openai)
    
    # Save description
    full_response["company_profile"]["description"] = description_openai

    # Categorise industry & business Model
    response_marketplace = exec.submit(generate_response, (prompt_marketplace +
                                header_url +
                                description_openai))
    
    # Categorise client Focus
    response_clientfocus = exec.submit(generate_response, (prompt_client_focus +
                                header_url +
                                description_openai))
    
    result_clientfocus = response_clientfocus.result()
    result_marketplace = response_marketplace.result()

    # Is the categorization correct for client focus? 
    response_dict_clientfocus = json.loads(result_clientfocus)
    client_focus = response_dict_clientfocus["client_focus"]

    tries_client_focus = 0
    while client_focus not in clients_focus and tries_client_focus < 2:
        if tries_client_focus >= 2:
            print('Client focus not in category.')
            print('GPT failed twice, returning NaN in Client focus.')
            client_focus = None
            break
        
        print('Client Focus not in category.')
        print('Faulty Client Focus: ', client_focus)
        initial_question = (prompt_client_focus+ header_url + description_openai)
        system_answer = "Client Focus: " + client_focus
        feedback = "The client focus is not within taxonomy... retry please and stay within the provided taxonomy."
        # Resubmit the prompt
        retry_clientfocus = exec.submit(generate_response_feedback, initial_question, system_answer, feedback)
        retry_clientfocus_dict = retry_clientfocus.result()
        retry_clientfocus_dict_json = json.loads(retry_clientfocus_dict)
        client_focus = retry_clientfocus_dict_json["client_focus"]
        tries_client_focus += 1
    
    # Save result client_focus
    full_response["company_profile"]["client_focus"] = client_focus  
    

    response_dict_clientfocus = json.loads(result_marketplace)
    platform_tag = response_dict_clientfocus["platform_tag"]
    tries_marketplace = 0

    while platform_tag not in platform_tags and tries_marketplace < 2:
        if tries_marketplace >= 2:
            print('Platform tag not in category.')
            print('GPT failed twice, returning NaN in Client focus.')
            platform_tag = None
            break
        
        print('Platform tag not in category.')
        print('Faulty Platform tag: ', platform_tag)
        initial_question = (prompt_marketplace+ header_url + description_openai)
        system_answer = "Platform tag: " + platform_tag
        feedback = "The categorisation is not within taxonomy... retry please and stay within the provided taxonomy."
        # Resubmit the prompt
        retry_platform = exec.submit(generate_response_feedback, initial_question, system_answer, feedback)
        retry_platform_dict = retry_platform.result()
        retry_platform_dict_json = json.loads(retry_platform_dict)
        platform_tag = retry_platform_dict_json["platform_tag"]
        tries_marketplace += 1

    full_response["company_profile"]["platform_tag"] = platform_tag
    full_response_json = json.dumps(full_response)
    json_string_pretty = json.dumps(full_response, indent=2)
    print('')
    print(json_string_pretty)
    print('--------------------------------')

    return full_response_json


#### Demo for MangoPay

In [114]:
# industryGPT('Mano Mano', 'cargo.one')

#### CSV enrichment

In [115]:
df = pd.read_csv('MangoPay - Enrichment - Sheet1.csv')
df.columns
df.dropna(how='all', inplace=True)

In [116]:
df

Unnamed: 0,id,company name,website,domain,R-website_url,S-Country
0,9,Cargo One,https://www.cargo.one/,cargo.one,http://www.cargo.one,Germany
1,12,Laserhub,https://laserhub.com/,laserhub.com,https://laserhub.com/,Germany
2,14,Merxu,https://merxu.com/pl/,merxu.com,https://merxu.com/,Poland
3,15,Metalshub,https://www.metals-hub.com/,metals-hub.com,https://www.metals-hub.com,Germany
4,17,Ontruck,https://www.ontruck.com/en,ontruck.com,http://ontruck.com,Spain
5,18,Rooser,https://www.rooser.eu/,rooser.eu,https://rooser.eu/,United Kingdom
6,21,Storefront,https://www.thestorefront.com/,thestorefront.com,https://www.thestorefront.com/,France
7,24,Container xChange,https://www.container-xchange.com/,container-xchange.com,https://www.container-xchange.com/,Germany
8,33,Ankorstore,https://www.ankorstore.com/,ankorstore.com,https://www.ankorstore.com/,France
9,36,Twine,https://www.twine.net/,twine.net,http://twine.net,United Kingdom


In [117]:
def process_row(name, website, index):
    try:
        # Call the API and capture the response
        response = industryGPT(name, website, index)
        response_json = json.loads(response)


        # Extract the required data from response
        platform_tag = response_json['company_profile']['platform_tag']
        client_focus = response_json['company_profile']['client_focus']
        description = response_json['company_profile']['description']

        return index, platform_tag, client_focus, description
    except Exception as e:
        print(f"Error processing row {index}: {e}")
        return index, None, None, None


df['Platform Tag'] = None
df['Client Focus'] = None
df['Description'] = None

executor = ThreadPoolExecutor(5)
# List to store futures
futures = []
for index, row in df.iterrows():
    name = row['company name']
    website = row['website'] 
    futures.append(executor.submit(process_row, name, website, index))

# Retrieve results and update DataFrame
for future in as_completed(futures):
    index, platform_tag, client_focus, description = future.result()
    df.at[index, 'Platform Tag'] = platform_tag
    df.at[index, 'Client Focus'] = client_focus
    df.at[index, 'Description'] = description
    df.to_csv('Enriched-MangoPay.csv')

# Shutdown the executor
executor.shutdown()

Enriching:  Cargo One
With URL:  https://www.cargo.one/
Enriching:  Laserhub
With URL:  https://laserhub.com/
Enriching:  Merxu
With URL:  https://merxu.com/pl/
Enriching:  Metalshub
With URL:  https://www.metals-hub.com/
Enriching:  Ontruck
With URL:  https://www.ontruck.com/en

-> Retrieved text from website...
-> Crafting company description...

-> Retrieved text from website...
-> Crafting company description...

-> Retrieved text from website...
-> Crafting company description...

-> Retrieved text from website...
-> Crafting company description...

-> Retrieved text from website...
-> Crafting company description...

Description of the company:  Ontruck is a company that operates in the freight shipping and logistics industry. They provide transport services for pallets, packages, and parcels, ensuring that goods arrive safely and on time. They have developed AI-powered software that connects to management systems to solve road freight problems. This software includes an AI prici

In [119]:
df.to_csv('MangoPay_Enriched.csv')