# Task Description

As a sales intelligence platform, we’re collecting all kinds of information about different companies. The purpose of this is to make a more informed decision about which companies we or our clients should target (try to win as a customer) and which not. At the moment, for example, we try to find out what is the industry of the company, what is its business model, who are its customers (businesses or private people, etc.). There could be different sources for this information; however we only use openly available information on the internet. It could be a linkedin page of the company, any type of news- and blog- posts or the company webpage. Especially the latter is a particularly good source of information, since:

    There are no limitations on what a company can present. No size limitation, for example.

    Usually there is no anti scraping protection - companies do not mind their webpages to be scraped since they actually want to make the information about themselves as much spread out as possible

On the other hand, it’s not so easy to extract any specific information from the company webpage, since there is no predefined structure - each website has its own, sometimes unique, design and structure.

We provide you with a dataset, that is a list of the company websites.

https://drive.google.com/file/d/1eCOboCXzUdEXeDgOj3rdcBvac5Mq1f7f/view?usp=sharing

The challenge consists of several tasks.

    Come up with an idea of what kind of useful information we could extract from the website. This information should be helpful for UserGems to decide if the company in question could be a good target for selling the UG platform to them. This useful piece of information we usually call a signal. In this first task we don’t evaluate your skills as a Data Scientist, but more the ability to think from a business perspective.

    Find a way to scrape and extract the suggested signal from the webpage. You can use any kind of scraping service to get the contents of the webpage and also use LLMs for parsing the data. We can give you limited access to OpenAI API if necessary. This access will have a limitation on how many tokens you can use. If you run out of the tokens, let us know and we might increase your limit.

    Using the dataset you’ve created with the LLM, train a local model to extract the signal from the webpage. The point of training our own model is to replace calling the OpenAI API, for extraction of the signal, which is slow and costly. We do not expect this model to be very accurate, but you need to provide the evaluation of how good your model is. It’s up to you what model to use. It can, for example, be a small transformer model from Huggingface. In case you need GPUs for training the model, you can use the freely available google colab notebooks with GPU support.


Important Points

❗Be aware that this is an open-ended challenge. Don’t get lost in the multitude of nuances, but rather focus on some specific points/signals that potentially could bring value to our product. We value a simple solution that works rather than excessive theoretical considerations.

❗Consider that we don’t expect you to scrape all webpages in the dataset, but rather to come up with a working solution that could easily be extended to large datasets. You might use additional columns in the dataset to filter out the companies that could be more relevant to the signal you come up with.

Time & Compensation

We expect you to spend around one full working day on the problem. 

Your effort will be compensated with 350 Euros. 

During this challenge you might want to use openai. 

For that you can use the following API key: https://share.1password.com/s#t8bY_OHbkfMEa6f25au1wBDyWAgKqty_YDAp5OSrEqo
If you have any problems with the access or any other questions - you can reach out to us, use the option "reply to all", my colleagues Andrey or Jalob are in cc.

Best Regards,

Elena


# 1. Signals for Usergems

Here are some UG signal ideas:
- From a list of signals, what would be the top 3 for a company, how well do they match/how relevant are they (the company description)?
    - Hi X, ... at UG we could give you the following 3 signals (that you can't get anywhere else):
        - S1
        - S2
        - S3
- Match closest customer
    - From a list of UG customers, who is the closest one? And then match up (and qualify if it's a good match)
    - Hi X, we work with {company name}, who similar to you is doing X. They use UG for Y. Want to check it out?
- Relying on jobs
     - are they hiring?
     - are they hiring AE/SDRs?
     - new sales inititiatives?
     - are they advertising in their jobs any of the tools you integrate with?
- Do they have a product / service that needs a sales team (i.e. not self serve)
    - can't use this in copy, but gets rid of "university press", gets rid of not launched products, gets rid of self-serve products
     
     
From a business perspective, finetuning (a local model) is not worth it in this case.
- 60k companies -- not that much cost savings


# 2. After manually reviewing first 20 sites (CS)

The input data is quite noisy. Out of the first 20 companies (computer software)
- 7 pages didn't even load
- 2 are not CS companies

After reviewing the website: another good signal is "are they selling a product/service to end customers"
- gets rid of no loa
- gets rid of "Michigan Unversity Press"

# 3. Improvements/tests I didn't have time for
- Use a DB instead of a sequence of csv files
    - even just SQLite would be good
- Optimize scraper
    - Use stealth mode / anti-anti-scraping tech
    - Investigate if there are major failures
    - Add retries
    - Add more elaborae JS waiting if needed
    - Record redirects for deduplication
- html2markdown
    - I know there is a benchmark of different repos, but couldn't find it
    - There are also a bunch of new repos I didn't see previously
    - Do some side-by-side comparisons for which markdown converter is the best (vs. vanially html!)
    - Do more html stripping (i.e. what I do for images)
        - Good way to go about it is inspect the HTML length, markdown length, and the ratios
    - Sometimes text whitespace gets too squashed. Need to inspect where this happens
- Invalid websites
    - ~50% of invalid websites are actually valid, so this part can be made better.
        - some we couldn't scrape
        - some got mistakenly classified
    - but it's only 10% of the list, 