### Data enrichment with LLM's | V1 Evergrowth


<img src="/Users/ismadoukkali/Desktop/llm-data-enrichment/diagrams/evergrowth_diagrams.png" alt="Alt text" title="Image Title" width="1000"/>


#### Importing libraries

In [None]:
from llama_index import SimpleDirectoryReader
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from urllib.request import urlopen, Request
from dotenv import load_dotenv
import json
import pandas as pd
import requests
from llama_index import VectorStoreIndex
from llama_index.llms import OpenAI
from urllib.parse import urljoin, urlparse
from zenrows import ZenRowsClient
import os


#### OpenAI API functions

In [None]:
load_dotenv()
client = OpenAI()

def generate_response_gpt4(prompt):
    response = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[
        {"role": "system", "content": "You are a helpful assistant, expert in Analysing Companies."},
        {"role": "user", "content": str(prompt)},
    ],
    max_tokens=300,
    temperature=0
    )
    selection = response.choices[0].message.content
    return selection

def generate_response_gpt4_json(prompt):
    response = client.chat.completions.create(
    model="gpt-4-1106-preview",
    response_format={ "type": "json_object" },
    messages=[
        {"role": "system", "content": "You are a helpful assistant, expert in Analysing Companies."},
        {"role": "user", "content": str(prompt)},
    ],
    temperature=0
    )
    selection = response.choices[0].message.content
    return selection

def generate_response_gpt3_json(prompt):
    response = client.chat.completions.create(
    model="gpt-3.5-turbo-1106",
    response_format={ "type": "json_object" },
    messages=[
        {"role": "system", "content": "You are a helpful assistant, expert in Analysing Companies."},
        {"role": "user", "content": str(prompt)},
    ],
    max_tokens=300,
    temperature=0
    )
    selection = response.choices[0].message.content
    return selection

Prompt to identify relevant links within sitemap

In [None]:
prompt_identify_links = """

Identify whether this url handle is relevant to identify whether this company holds physical stores of their own and/or have partnership with retailers. 

This includes details about the e-commerce platform, physical store locations, wholesale operations, and partnerships with external retailers.

Common URL Handles:

/about-us or /company-info: Often contains comprehensive information about the company's history, business model, and operational strategies.
/wholesale or /b2b: Pages specifically dedicated to wholesale operations and business-to-business sales.
/store-locator or /find-a-store: Helps in identifying physical store locations, indicating a brick-and-mortar sales channel.
/our-partners, /retailers or /dealers: Information about partnerships with external retailers.
/investor-relations or /press: Might contain detailed information about business operations for stakeholders.


If the link is relevant, output 'yes' if not, output 'no'. Output the result in JSON format:

{
    "link": "input url"
    "relevant": "yes" or "no"
}

Here the site: 

"""


Prompt to craft description for what the business does

In [None]:
organize_prompt = '''
From the following scrapped text of a website, explain what the business sells exactly, the business model and the consumer channels this company utilises to sell. 

The target consumer channels to look for are: 
1. Ecommerce
2. Owned Physical Stores
3. Wholesale / Retail Stores and/or Marketplaces

In bulletpoints, indicate if this company operates in the consumer channels above.

This company description is going to be fed to a company categorization model where the combination of the channels for selling like own physical stores and/or traditional retail channels and/or ecommerce websites is a crucial component of the classification.

Search for keywords and indicators that can inform all of the variables mentioned above like button names and else... if in doubt for a selling channel, be determinisic and always include it if its mentioned. Don't doubt mentions in the context.

Write everything in third person naming the company. Utilise terminology that is accurate for the specific industry and business model. 

Make this analysis short and concise.

'''

Prompt to classify into specific category

In [None]:
categories_prompt = """
                    
    Classify the following described company within the provided categories, taking into account its what it sells, its business model, terminology used to described it and
    the sales channels the company uses to operate. 
    
    Stay within the categories mentioned in the context, never deviate. These categories are:
    
    "Pure-player D2C Brand": "A Pure-player D2C (Direct-to-Consumer) Brand specializes in selling products online directly to consumers, bypassing traditional retail or physical store channels. 
    Pure-player D2C Brands sell their goods directly to consumers via their website with no intermediaries, retailers or wholesalers.

    "Omnichannel D2C Brand": "An Omnichannel D2C Brand combines their own physical store presence with a robust online platform to sell products directly to their consumers.
    These brands sell exclusively through their own physical locations and their online channels, by passing traditional retailers. They are known for their global reach, trendsetting products, and innovative retail strategies. 

    "Omnichannel Retailer": "Omnichannel Retailers offer a unified traditional retail shopping experience through their network of physical stores and digital platforms selling products from other brands. 
    They prioritize customer convenience, integrating online and offline channels for a seamless transaction process. 

    "Pure-player Retailer": "Pure-player Retailers operate exclusively through e-commerce platforms focusing on the resale of branded products that are not of their own. 
    These retailers do not engage in product manufacturing, emphasizing online sales, logistics, and digital customer service.

    "Retail Branded Goods Manufacturer": "Retail Branded Goods Manufacturers design, produce, and sell their products, leveraging e-commerce, potentially their own physical stores and conventional retail channels/wholesalers. 

    "FMCG & CPG": "FMCG (Fast-Moving Consumer Goods) & CPG (Consumer Packaged Goods) companies specialize in manufacturing, designing and producing of perishable goods which are sold exclusively through traditional retail channels. 
    All companies that focus on low-cost products, often used daily, like groceries, foods, cosmetics, drinks and household items should be classified in this category and that have as main channels of selling wholesale retailers. 
    They might have some activity in ecommerce but it may be minimal.

    "Online Marketplace": "Online Marketplaces provide a digital platform for third-party sellers and buyers to transact. They facilitate a wide range of goods, from consumer electronics to handmade crafts. These platforms focus on e-commerce efficiency, digital transactions, and customer-seller interaction without engaging in product manufacturing or branding. The Online Marketplace does not own the stock of goods sold on its platform, 
    it only provides the space for professional sellers and buyers to interact.

    "Classified Ads": "Companies in the Classified Ads sector facilitate the sale of goods through advertisement platforms. They provide a space for individual sellers and individual buyers to connect, often for second-hand or niche items, without directly engaging in the sales process themselves. Compared to online marketplaces, where the sellers tend to be established businesses, classified ad companies give a peer-to-peer space where stakeholders can sell their items.
    Here some additional keywords related to the type of company: classified advertising, second-hand goods platform, buyer-seller advertisement, online classifieds, print classified ads, niche product sales, individual seller platform, advertisement-based sales, peer-to-peer marketplace, local goods advertising."

    "ICP no": "Companies which do not fall within the e-commerce and/or retail business model are classified as ICP no. These include companies with models similar to like SaaS, Apps or others different from e-commerce or retail.
                                
    Here the description of the company in question:


"""

#### Function for google search API

In [None]:
def google_search(query):
    API_KEY = os.getenv("GOOGLE_SEARCH_API")
    SEARCH_ENGINE_ID = '3232eee26d51543f1'

    url = 'https://www.googleapis.com/customsearch/v1'

    params = {
        'key': API_KEY,
        'cx': SEARCH_ENGINE_ID,
        'q': query
    }

    response = requests.get(url, params=params)

    if response.status_code == 200:
        results = response.json()
        if 'items' in results:
            return results['items'][0]['link']
        else:
            return 'No results found'
    else:
        return f'Error: {response.status_code}'

#### Helper functions

In [None]:
def truncate_string(input_string):
    if len(input_string) <= 3000:
        return input_string
    else:
        return input_string[:3000]

def format_url(url):
    if url.startswith("http://www."):
        url = url.replace("http://www.", "https://www.")
    elif url.startswith("http://"):
        url = url.replace("http://", "https://www.")
    elif url.startswith("www."):
        url = "https://" + url
    elif not url.startswith("https://"):
        url = "https://www." + url
    return url

def retrieve_html_old(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    req = Request(url, headers=headers)
    html = urlopen(req).read()
    soup = BeautifulSoup(html, features="html.parser")

    for script in soup(["script", "style"]):
        script.extract() 

    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

#### Zenrow API function scrapping

In [None]:
def retrieve_html(url):
    client = ZenRowsClient(os.getenv("ZEN_ROWS_SCRAPER"))

    params = {"block_resources": "image,media,font",
              "premium_proxy":"true"}

    response = client.get(url, params=params)

    soup = BeautifulSoup(response.text, "html.parser")

    # Process the HTML as before
    for script in soup(["script", "style"]):
        script.extract()

    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text

#### Function to extract all links from sitemap

In [None]:
def extract_all_links(base_url):
    client = ZenRowsClient(os.getenv("ZEN_ROWS_SCRAPER"))

    params = {"block_resources": "image,media,font",
            "premium_proxy":"true"  }

    response = client.get(base_url, params=params)

    try:
        soup = BeautifulSoup(response.text, 'html.parser')
        a_tags = soup.find_all('a')
        links = set()  
        links.add(base_url)
        for tag in a_tags:
            if 'href' in tag.attrs:
                full_url = urljoin(base_url, tag['href'])
                if urlparse(full_url).netloc == urlparse(base_url).netloc:
                    links.add(full_url)
    
        return links
    except Exception:
        print("Failed to retrieve the main webpage.")
        return []

# Example usage
#url = "https://wholeearthsweetener.com/"  # Replace with the desired URL
#links = extract_all_links(url)
#print(links)

In [None]:
from openai import OpenAI

load_dotenv()
client = OpenAI()

executor = ThreadPoolExecutor(100)

def extract_relevant_links(base_link, links):
    website = prompt_identify_links
    futures = []
    for link in links:
        future = executor.submit(generate_response_gpt3_json, (website + '' + link))
        futures.append(future)

    relevant_links = set()
    for future in futures:
        result_json = json.loads(future.result())
        answer = result_json["relevant"]
        relevant_link = result_json["link"]
        if answer == "yes":
            relevant_links.add(relevant_link)
    
    relevant_links.add(base_link)
    print('Went from all links in site: ', len(links))
    print('To this size: ', len(relevant_links))
    print('Relevant links: ', relevant_links)

    return relevant_links

#### Function to build context file

In [None]:
def scrape_sites(urls):
    all_text = ""
    for url in urls:
        try:
            text = retrieve_html(url)
            all_text += text + "\n\n" + "------------------------------------------------" + "\n\n" 
        except Exception as e:
            print(f"Error retrieving {url}: {e}")

    with open('data/site_data.txt', 'w'):
        pass
    
    with open('data/site_data.txt', 'w', encoding='utf-8') as file:
        file.write(all_text)
    
    print('Company context succesfully updated :)')

#### Main loop function with timeout

In [None]:
import signal
import time

class TimeoutException(Exception):
    pass

def timeout_handler(signum, frame):
    raise TimeoutException

def run_function_with_timeout(industryGPT, company):
    signal.signal(signal.SIGALRM, timeout_handler)
    
    signal.alarm(200)  

    try:
        start_time = time.time()
        
        result = industryGPT(company)
        
        elapsed_time = time.time() - start_time
        print(f"industryGPT: Completed in {elapsed_time:.2f} seconds.")
        
        signal.alarm(0)
        
        return result
    except TimeoutException:
        print('Result took too long to output.')
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

signal.alarm(0)

In [None]:
from openai import OpenAI

load_dotenv()
client = OpenAI()

def industryGPT(company):
    try:
        all_links = extract_all_links(company)
        relevant_links = extract_relevant_links(company, all_links)
        scrape_sites(relevant_links)
            
    except Exception as e:
        print('Error when looking at site, ', company, ':', e)
        return None

    from llama_index.llms import OpenAI
    from llama_index import ServiceContext
    
    documents = SimpleDirectoryReader("data").load_data()
    llm = OpenAI(model="gpt-4-1106-preview", temperature=0, max_tokens=300)
    service_context = ServiceContext.from_defaults(llm=llm)

    index = VectorStoreIndex.from_documents(
        documents, service_context=service_context
    )

    query_engine = index.as_query_engine(
        response_mode="refine", streaming=True, similarity_top_k=1
    )
    

    description = query_engine.query(organize_prompt)
    
    print()
    # description.print_response_stream()

    description_no_streaming = ""

    for text in description.response_gen:
        print(text, end='')
        description_no_streaming += text
        pass

    # print(description_no_streaming)
    

    classification = generate_response_gpt4_json(categories_prompt + description_no_streaming + """

                            Output your response in json format like this: 
                            
                            {
                                name: "company name",
                                category: "identified category"
                            }

                            Never deviate from the categories above, always stay within the taxonomy. 
                            If the category is not possibly identified just output "ICP no" there.

                            """)
    
    print()
    print()
    response_dict = json.loads(classification)
    print("Company name: ", response_dict["name"])
    print("Category: ", response_dict["category"])
    print("JSON: ", response_dict)
    response_dict["description"] = description_no_streaming
    print()
    print('--------------------')
    print()


    return response_dict

### One enrichment demo

Run from here to get test enrichment. 

In [None]:
response = run_function_with_timeout(industryGPT, "https://wholeearthsweetener.com/")

### Multiple enrichment demo's
Run from here for a batch enrichment.

In [None]:
df = pd.DataFrame(columns=["Name", "Category", "Description"])

In [None]:
array_of_companies = ["https://wholeearthsweetener.com/", "https://www.charlottesweb.com/", "https://www.factor75.com/", 
                      "https://candyhackers.com/", "https://us.air-up.com/", "https://www.arrae.com/", "https://www.greenchef.com/", 
                      "https://thevitacococompany.com/", "https://www.gumtreegolfandnature.com/", "http://www.epicbar.com", 
                      "https://blendjet.com/", "https://chomps.com/", "https://effingoodsnacks.com/", "https://joinfightcamp.com/", 
                      "https://jackweirandsons.com/", "http://www.thepurplecarrot.com", "https://www.maximustribe.com/", "https://www.fuelmeals.com/", 
                      "https://gfuel.com/", "https://www.freshnlean.com/", "https://havenskitchen.com/", "https://www.hungryroot.com/", "https://www.justmeats.com/", 
                      "https://www.gumtreegolfandnature.com/", "https://www.cookunity.com/", "https://jain.golf/", "https://www.oatsovernight.com/", 
                      "https://www.nutpods.com/", "https://www.myollie.com/", "https://opalcamera.com/", "https://poponveneers.com/", "https://publicdrip.com/", 
                      "https://www.readyrefresh.com/", "https://softframedesigns.com/", "https://mymetabolicmeals.com/", "https://sprinly.com/", "https://smartfinancial.com", 
                      "https://www.suvie.com/", "https://www.tempomeals.com/", "https://aliceandolivia.com", "https://www.trifectanutrition.com/", "https://fender.com"]

enrichment = []
for company in array_of_companies:  
        
    name = None   
    category = None 
    
    print()
    try:
        print('Retrieving: ', format_url(str(company)))
        response = run_function_with_timeout(industryGPT, company)
        name = response["name"]
        category = response["category"]
        description = response["description"]

    
    except Exception:
        print("Error while enriching site: ", company)
        print()
        pass

    print('--------------------------------')

    new_row = { "Name": name,
                "Website" : company,
                "Category": category,
                "Description": description
            }
    
    df.loc[len(df)] = new_row

df.to_csv("Evergrowth_Enriched.csv")