# Installation

## 1. Optional: create virtual env

```
python -m venv venv

venv/bin/activate

```

## 2. Install Repo

```
pip install .
```

Note, this does not include the finetuning notebook dependencies (pytorch, unsloth, etc.).

Note, if you want to re-run the data-processing (i.e. `/scripts` folder files), use an editable install instead (`pip intall -e .`)

## 3. Add OpenAI API key

Add a `.env` file in the project root (or just set it outside)

```
OPENAI_API_KEY=XXX
```

## 4. Use the CLI

The qualifier CLI is automatically installed:
```
ugfind https://notion.so
```

## 5. Use the qualified list for outreach

You find the csv (with an example copy) under `data/06_qualified.csv`.

# Notes

## 1. Signals for Usergems

Here are some UG signal ideas:
- From a list of signals, what would be the top 3 for a company, how well do they match/how relevant are they (the company description)?
    - Hi X, ... at UG we could give you the following 3 signals (that you can't get anywhere else):
        - S1
        - S2
        - S3
- Match closest customer
    - From a list of UG customers, who is the closest one? And then match up (and qualify if it's a good match)
    - Hi X, we work with {company name}, who similar to you is doing X. They use UG for Y. Want to check it out?
- Relying on jobs
     - are they hiring?
     - are they hiring AE/SDRs?
     - new sales inititiatives?
     - are they advertising in their jobs any of the tools you integrate with?
- Do they have a product / service that needs a sales team (i.e. not self serve)
    - can't use this in copy, but gets rid of "university press", gets rid of not launched products, gets rid of self-serve products
     
     
From a business perspective, finetuning (a local model) is not worth it in this case.
- 60k companies -- not that much cost savings


## After manually reviewing first 20 sites (CS)

The input data is quite noisy. Out of the first 20 companies (computer software)
- 7 pages didn't even load
- 2 are not CS companies

After reviewing the website: another good signal is "are they selling a product/service to end customers"
- gets rid of no loa
- gets rid of "Michigan Unversity Press"

## 2. Finetuning

- Left too little time, and had no GPU... so had to set up server on AWS & SSH
- Also only trained for ~10min (single GPU) for demonstration, real life would train longer

## 3. Improvements/tests I didn't have time for
- Use a DB instead of a sequence of csv files
    - even just SQLite would be good
- Optimize scraper
    - Use stealth mode / anti-anti-scraping tech
    - Investigate if there are major failures
    - Add retries
    - Add more elaborae JS waiting if needed
    - Record redirects for deduplication
- html2markdown
    - I know there is a benchmark of different repos, but couldn't find it
    - There are also a bunch of new repos I didn't see previously
    - Do some side-by-side comparisons for which markdown converter is the best (vs. vanially html!)
    - Do more html stripping (i.e. what I do for images)
        - Good way to go about it is inspect the HTML length, markdown length, and the ratios
    - Sometimes text whitespace gets too squashed. Need to inspect where this happens
- Invalid websites
    - ~50% of invalid websites are actually valid, so this part can be made better.
        - some we couldn't scrape
        - some got mistakenly classified
    - it's ~10% of the list, so there is still juice here
- Job classification
    - open apply & no jobs gets frequently confused with edge cases. Product-wise it's the least important destinction, but for a proper system, need some more time on this
- Job extraciton
    - This really should be a separate step, but didn't have time so just hacked it onto the classification
- General GPT calls
    - Definitely need more evals
    - With a good eval set can do some prompt exploration
    - Should add production checks (i.e. if the job judgment is job list, the list should be non-empty)
- Finetuning
    - Super little time on this, so just doing very high-level
    - Used Unsloth starter notebook
    - To do properly:
        - Don't do on completely separate notebook, integrate with rest of the codebase 
        - Check context sizes - might need to trim input or increase unsloth context
        - Check different models & settings. 
        - Train on much larger dataset 
            - Also make sure the training dataset is actually clean
        - Train for longer (until validation metrics deteriorate)
        - Have separate valid & test set
        - Train whole pipeline / schema, not just classification
        - Set up instructor!
            - Right now we're just hoping LLAMA gets it right
        - And obviously proper deployment
- Whole flow:
    - Check what is happening with the loops; it's ridiculously high
- Add a web app for this!