### industryGPT - Client Foils

This is a python notebook to illustrate the system used for client Foils to classify potential prospect companies. The way to do this is by enriching the basic set of useful data points to filter potential target accounts. 

Here is the data points enriched in this specific demo:
- **Industry**: Industry in which the company operates.
- **Business Model**: Business Model in which the company operates.
- **Client Focus**: Subset of customer business models does the company target.
- **End Buyer**: Subset of departments the company targets.
- **Innovation Challenges**: Potential data points the company would be interested in enriching.

In [198]:
from openai import OpenAI
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import time
import os
from googlesearch import search
from dotenv import load_dotenv
from datetime import datetime
import json
import pandas as pd


##### OpenAI functions

In [199]:
load_dotenv()
client = OpenAI()

def generate_response(prompt):
    response = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_format={ "type": "json_object" },
    messages=[
        {"role": "system", "content": "You are a helpful assistant, expert in Analysing Companies."},
        {"role": "user", "content": str(prompt)},
    ],
    temperature=0
    )
    selection = response.choices[0].message.content
    return selection

def generate_response_gpt3(prompt):
    response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant, expert in Analysing Companies."},
        {"role": "user", "content": str(prompt)},
    ],
    max_tokens=300,
    temperature=0
    )
    selection = response.choices[0].message.content
    return selection

#### Scrapping function

In [200]:
def retrieve_html(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    req = Request(url, headers=headers)
    html = urlopen(req).read()
    soup = BeautifulSoup(html, features="html.parser")

    for script in soup(["script", "style"]):
        script.extract() 

    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

In [201]:
def selenium_search(query):
    for url in search(query + ' website', lang = 'en'):
        return url

#### Client Focus Prompt

In [202]:
client_focus = ["B2C", "B2B", "B2B/B2C"]
prompt_client_focus = """

You are given the description of a company retrieved from a company's website.

You will need to categorize the company within a Client Focus according to the taxonomy below.

There can only be ONE client focus. The answer needs to be concise and only can follow the taxonomy.

DO NOT come up with any taxonomy. STICK to the taxonomy below. 

Give your results in the format JSON  - 

'client_focus': [client focus] 


Here the taxonomy for the client focus:
B2C
B2B
B2B/B2C
"""

#### Industry Prompt

In [203]:
startup_industries = ["Building Materials", "Business Intelligence", "Cannabis", "Civil Engineering Construction", "Cloud Infrastructure", "Collaboration", "Commercial Recycling & Waste Management", "Contract Management", "Corporate Social Responsibility (CSR) Solutions", "Customer Relationship Management (CRM)", "Database & File Management", "Digital Transformation Consulting", "E-Commerce", "Electronics", "Emergency & Law Enforcement", "Employee Compensation", "Energy, Utilities & Waste", "Engineering Software", "Enterprise Resource Planning (ERP)", "Event Management", "Events Services", "Facilities Management & Commercial Cleaning", "Financial Services", "Food & Beverage", "Freight & Logistics", "Gaming", "Health, Beauty & Fitness", "Healthcare", "Hospitaltiy Management Systems", "Human Resources (HR)", "Human Resources Consulting", "Innovation Management Consulting", "Internet Publishing & Classifieds", "IT & Infrastructure", "Leasing Non-Residential Real Estate", "Legal", "Managed It Services", "Management Consulting", "Market Intelligence Platform", "Market Research", "Marketing & Advertising", "Media & Entertainment", "Medical Devices & Equipment", "Miscellaneous Manufacturing", "Multimedia & Graphic Design", "Oil & Gas Exploration", "On Demand Delivery", "On Demand Services", "On Demand Transportation", "Payments Infrastructure & Solutions Provider", "Payroll & Benefits", "Procurement Solutions", "Productivity & Automation", "Real Estate", "Research & Development", "Robotics", "Sales & Lead Generation", "Security & Compliance", "Security Products & Services", "Staffing & Recruiting", "Storage Solutions Provider", "Supply Chain Management (SCM)", "Telephony & Wireless", "Textiles & Apparel", "Training & Education", "Translation & Linguistic Services", "Travel & Expense Management", "Video Games Distribution", "Video Games Production"]

prompt_industry_business =  f"""

You are given the description of a company retrieved from a company's website.

You will need to categorize the company within their Industry and Business Model Category according to the taxonomy below.

There can only be ONE industry and ONE business model. The answer needs to be concise and only can follow the taxonomy.

DO NOT come up with any taxonomy. STICK to the taxonomy below. 

Give your results in the format format JSON - 

'industry': [industry] 
'business_model': [business model]

Here the taxonomy for each of the Industries and Business Models - 

Business Models:
SaaS
Marketplace
Ecommerce
Service
Manufacturing 

Industries:

"""



#### End Buyer Prompt

In [204]:
end_buyer = ["Sales", "Marketing", "HR", "Finance", "Tech & Data", "Legal", "Procurement", "Client support", "CSE", "ESG", "Communication", "Consumer"]
prompt_end_buyer =  f"""

You are given the description of a company retrieved from a company's website.

You will need to categorize the company within an End Buyer category according to the taxonomy below.

There can only be ONE end buyer The answer needs to be concise and only can follow the taxonomy.

DO NOT come up with any taxonomy. STICK to the taxonomy below. 

Give your results in the format format JSON - 

'end_buyer': [end buyer] 

Here the taxonomy for each of the Industries and Business Models - 

End Buyer:
Sales
Marketing
HR
Finance
Tech & Data
Legal
Procurement
Client support
CSE
ESG
Communication
Consumer
"""

#### Sector and Innovation challenges prompt


In [205]:
challenges_prompt = ''' 

You are given the description of a company retrieved from a company's website.

You will need to explain in very short words the sector & innovation challenges that they might be facing.

Give your results in the format format JSON - 

'sector_challenge': [challenges of the sector] 
'innovation_challenge': [challenges of innovation]
'''

In [206]:
header_url= "Here their description deducted from their website: "
organize_prompt = 'Rewrite the following scrapped text from a company main website. Explain what the business does. The text will be poorly written, take that in mind. Make the description short and concise. Here the scrapped text: ' 

In [207]:
def truncate_string(input_string):
    if len(input_string) <= 3000:
        return input_string
    else:
        return input_string[:3000]

def format_url(url):
    if url.startswith("http://www."):
        url = url.replace("http://www.", "https://www.")
    elif url.startswith("http://"):
        url = url.replace("http://", "https://www.")
    elif url.startswith("www."):
        url = "https://" + url
    elif not url.startswith("https://"):
        url = "https://www." + url
    return url

In [208]:
def parse_response(input_string):
    lines = input_string.split('\n')
    pairs = [line.split(':') for line in lines]
    pairs = [[key.strip(), value.strip()] for key, value in pairs]
    data = dict(pairs)
    industry = data.get('Industry')
    business_model = data.get('Business Model')

    return industry, business_model

In [209]:
def business_status(company_employees, company_founded_date):
    company_age = 2024 - company_founded_date

    if company_employees < 200 and company_age <= 5 or company_employees < 200:
        return 'Startup'
    elif (201 <= company_employees <= 1000 and 3 <= company_age < 15) or 501 <= company_employees <= 1000:
        return 'MidMarket'
    elif (company_employees >= 1001 or company_age >= 15) or company_employees >= 1001:
        return 'Corporate'
    else:
        return 'Uncategorized'

In [210]:
def industryGPT(name, url):
    print('Enriching: ', name)
    print('With URL: ', url)

    full_response = {
            "metadata": {
                "timestamp": str(datetime.now()),
                "source": "industryGPT"
            },
            "company_profile": {
                "website": None,
                "business_status": None,
                "industry": None,
                "business_model": None,
                "end_buyer": None,
                "client_focus": None,
                "description": None,
                "sector_challenge": None,
                "innovation_challenge": None
            }

        }
    
    # Categorise industry & business model
    url_text = retrieve_html(format_url(url))
    print('\n-> Retrieved text from website...')
    
    if len(url_text) > 100:
        print('-> Crafting company description...')
        description_openai = generate_response_gpt3(organize_prompt + truncate_string(url_text))
        print('\nDescription of the company: ', description_openai)

    else:
        print('Scrapped website has less than 100 characters...')
        print('Trying to search on google a new page...')
        new_url = selenium_search(format_url(url))
        url_text = retrieve_html(new_url)
        description_openai = generate_response_gpt3(organize_prompt + truncate_string(url_text))
        print('\nDescription of the company: ', description_openai)

    # Save description
    full_response["company_profile"]["description"] = description_openai

    industry_categories = startup_industries
    selected_industries = ', '.join(industry_categories)

    # Categorise industry & business Model
    response_industry = generate_response(prompt_industry_business +
                                selected_industries +
                                header_url +
                                description_openai)
    
    response_dict_industry = json.loads(response_industry)
    full_response["company_profile"]["industry"] = response_dict_industry["industry"]
    full_response["company_profile"]["business_model"] = response_dict_industry["business_model"]

    # Categorise client focus
    response_clientfocus = generate_response(prompt_client_focus +
                                header_url +
                                description_openai)
    
    response_dict_clientfocus = json.loads(response_clientfocus)
    full_response["company_profile"]["client_focus"] = response_dict_clientfocus["client_focus"]

    # Categorise end buyer
    response_endbuyer = generate_response(prompt_end_buyer +
                                header_url +
                                description_openai)
    
    response_dict_endbuyer = json.loads(response_endbuyer)
    full_response["company_profile"]["end_buyer"] = response_dict_endbuyer["end_buyer"]

    # Categorise challenges of the sector and of innovation
    response_challenges = generate_response(challenges_prompt +
                                header_url +
                                description_openai)
    
    response_dict_challenges = json.loads(response_challenges)
    full_response["company_profile"]["sector_challenge"] = response_dict_challenges["sector_challenge"]
    full_response["company_profile"]["innovation_challenge"] = response_dict_challenges["innovation_challenge"]
    

    full_response_json = json.dumps(full_response)

    json_string_pretty = json.dumps(full_response, indent=2)
    print('')
    print(json_string_pretty)

    return full_response_json

#### Demo for Foils 

In [217]:
# Inputting 'Name', 'Website', 'Company_ID', 'Employee Count', 'Founded Date'

industryGPT('Adopt​ Parfums', 'https://www.aluminiumdunkerque.fr/')

Enriching:  Adopt​ Parfums
With URL:  https://www.aluminiumdunkerque.fr/

-> Retrieved text from website...
-> Crafting company description...

Description of the company:  Aluminium Dunkerque is a company that specializes in the fabrication of aluminum plates and rolling ingots. They are committed to accelerating the decarbonization of their aluminum production. The company is involved in various projects, including the construction of an eighth furnace dedicated to recycling and the conversion of electrolysis tanks for low-energy consumption. They also prioritize safety, environmental sustainability, and community engagement.

{
  "metadata": {
    "timestamp": "2023-12-05 15:56:34.200447",
    "source": "industryGPT"
  },
  "company_profile": {
    "website": null,
    "business_status": null,
    "industry": "Miscellaneous Manufacturing",
    "business_model": "Manufacturing",
    "end_buyer": "Procurement",
    "client_focus": "B2B",
    "description": "Aluminium Dunkerque is a co

'{"metadata": {"timestamp": "2023-12-05 15:56:34.200447", "source": "industryGPT"}, "company_profile": {"website": null, "business_status": null, "industry": "Miscellaneous Manufacturing", "business_model": "Manufacturing", "end_buyer": "Procurement", "client_focus": "B2B", "description": "Aluminium Dunkerque is a company that specializes in the fabrication of aluminum plates and rolling ingots. They are committed to accelerating the decarbonization of their aluminum production. The company is involved in various projects, including the construction of an eighth furnace dedicated to recycling and the conversion of electrolysis tanks for low-energy consumption. They also prioritize safety, environmental sustainability, and community engagement.", "sector_challenge": ["High energy consumption and costs", "Greenhouse gas emissions and carbon footprint", "Competition from low-cost producers", "Fluctuating raw material prices", "Meeting strict environmental regulations", "Supply chain susta

Enriching with .csv from test

In [212]:
account = pd.read_csv('/Users/ismadoukkali/Desktop/industryGPT/industryGPT/foils/Feuille enrichissement.csv')
account

Unnamed: 0,Entreprise,URL,Enjeu d'innovation,Enjeux du secteur
0,JJA,https://www.jja-sa.fr/,,
1,Laboratoires Vivacy,https://vivacy.com/fr/,,
2,Legrand France,https://www.legrand.fr/?gad_source=1&gclid=Cjw...,,
3,Lucca,https://www.lucca.fr/,,
4,Maison Francis Kurkdjian,https://www.franciskurkdjian.com/eu-fr?gad_sou...,,
5,Mathematic Studio,https://mathematic.tv/,,
6,McPhy,https://mcphy.com/fr/,,
7,Imerys,https://www.imerys.com/fr,,
8,MONTEIRO,https://www.monteiro-fr.com/qui-sommes-nous.html,,
9,NovAliX,https://novalix.com/,,


In [215]:
df = account
df['Industry'] = None
df['Business Model'] = None
df['End Buyer'] = None
df['Client Focus'] = None
df['Innovation Challenges'] = None
df['Sector Challenges'] = None

df = pd.read_csv('/Users/ismadoukkali/Desktop/industryGPT/industryGPT/foils/enriched_foils_1.csv')

for index, row in df.iterrows():
    if index >= 40:
        name = row['Entreprise']
        website = row ['URL']
        results = industryGPT(name, website)
        response_json = json.loads(results)
        df.at[index, 'Industry'] = response_json["company_profile"]["industry"]
        df.at[index, 'Business Model'] = response_json["company_profile"]["business_model"]
        df.at[index, 'End Buyer'] = response_json["company_profile"]["end_buyer"]
        df.at[index, 'Client Focus'] = response_json["company_profile"]["client_focus"]
        df.at[index, 'Innovation Challenges'] = response_json["company_profile"]["innovation_challenge"]
        df.at[index, 'Sector Challenges'] = response_json["company_profile"]["sector_challenge"]
        df.to_csv('enriched_foils_2.csv', index=False)

Enriching:  Deepki
With URL:  https://www.deepki.com/fr/

-> Retrieved text from website...
-> Crafting company description...

Description of the company:  Deepki is a global leader in ESG (Environmental, Social, and Governance) solutions for the real estate industry. We help maximize the value of your real estate portfolio by guiding you through the transition to a more sustainable and environmentally friendly approach. With our comprehensive solutions, we monitor millions of square meters across multiple countries, detecting and achieving significant CO₂ emission savings. Our mission is to preserve the planet by making real estate more virtuous. Join our community of environmentally conscious real estate professionals and let us help you make a positive impact. Contact us for more information.

{
  "metadata": {
    "timestamp": "2023-12-05 15:47:49.292072",
    "source": "industryGPT"
  },
  "company_profile": {
    "website": null,
    "business_status": null,
    "industry": "Rea