# M7003_finalpiece_24000114067

LIS MASc  
The Right Word  
Final Piece (Choice 2: NLP)  
Student number: 24000114067

Access notebook, data, pdf and web app in the [GitHub repository](https://github.com/noah-art3mis/nlp-assignment).

PDF generated using [nbconvert](https://nbconvert.readthedocs.io/en/latest/)

## Intersect - Personalized job matching

_Intersect_ is a web app (access [here](https://intersect.streamlit.app/)) that can be used to find jobs that match your profile. The purpose of this tool is to help find interesting jobs quicker by using NLP instead of manually going through hundreds of results. It might also give you unexpected results, because it is not restricted to the normal job-related keyword search.

## Instructions

-   insights
    -   pics
-   corpus
    -   Data was scraped from cv-library around a month ago using keywords provided by classmates, using the location of London and default settings. This data source was chosen because it is the first one cited by the UK Government. It also complies with robots.txt. This was most of the work.
-   tools
-   500w
-   pdf w link to github w nnotebook and data
-   justify choices in analysis

-   The NLP-aided analysis of a corpus of documents will extract non-obvious insights from a corpus of documents using tools from coding and data science. The corpus of documents could consist of a literary archive, a body of social media posts, a scrape from an online knowledge database; ideally, these would have some relation to the capstone problem. The submission should offer a graphical representation of its discoveries and include a write up of c. 500 words the documents the process involved. A link to an executable code notebook should also be included as part of the submission.

-   Assessment Choice 2 should be submitted as a PDF file that contains a brief summary of the submission and which contains a link to a GitHUb repository where the code notebook and all relevant data can be found - max 500 words

-   The submission should use technical concepts from linguistics and/or NLP
    in practically useful way, drawing on relevant sources and tools.
-   The submission should justify the intellectual choices it makes with respect
    to analysis and/or optimisation in relation to theoretical methods and
    established methodologies—and do so in a critically informed way.
-   Where relevant, the submission should convincingly manipulate language
    and/or narrative qualitatively and/or quantitatively, using appropriate
    methods.
-   The assessment should evidence skills in the communication of complex
    ideas in its chosen format.

## Results

Hopefully this project can be actually useful to me and others in similar positions. The README contains a long list of possible improvements.

It is unclear what ordering the website uses, but it is different from all of our attempts.

These results were collected manually. For a more serious attempt one would need to evaluate the results with automated tests using standard benchmarks.

---

This work contains #TODO words (not counting tables and figures).

---


## Corpus

## Tools

The big idea here is the difference between _lexical search_ and _semantic search_. BM25, since it is based on TF-IDF, uses sparse vectors and is based on the structure of the words. It considers the amount of times a words appears in a document, in all the documents, and punishes words that appear too much. Semantic search using embeddings uses dense vectors and is based on the meaning of the words (ref).

The process of extraction, transformation and loading of this is as follows:

-   Do the search in the website using provided keyword and location.
-   Vectorize the results.
-   Acquire user text.
-   Vectorize the text.
-   Reorder the results of the website search using either semantic similarity or displacement.
-   Plot a reduced dimension space of the vectorized results.

    -   PCA and Kmeans are used because of their simplicity. Used 2 dimensions for ease of use, and calculates several numbers of clusters so the user can change the on the fly. Might be better to use t-SNE or LSA. Since the number of clusters is unkwnown, we should use meanshift or dbscan.

-   Independently from this, we also run

    -   BM25
        -   in this case we lowercase, drop stop words, use a stemmer and tokenize. We use a stemmer instead of a lemmatizer since it would be slower and this is supposed to be a real-time application. We could also remove numbers and special characters, but this would add unneeded complexity.
    -   reranker (cross-encoding).
        -   Semantic search can be fast enough for millions of results, but it is general. Cross encoding is a algorithm that not only compares the query vector with all the database rows, but compares them between themselves. This makes its time complexity higher, which means it gets significantly slower the more items you use. Is usually used after other methods to 'rerank' them on a subset (10 or 100 items) of the bigger result.
    -   Bag-of-words word cloud (using TF-IDF vectorizer for convenience).

-   We also tried

    -   named entity recognition
        -   ...
    -   permutation using llms
        ...

    however in the first case the results were both too slow and useless and in the second case the model did not follow the prompt. As we have enough already, these were dropped.

-   the idea is to intersect two searches - the original one with keywords and the semantic one with text. this makes it so you can search both for what you want and what you need at the same time. you can do keyword as a want (`fun`) or as need (`data`) and vice versa.

-   As far as tool choice goes:
    -   OpenAI, Streamlit and Cohere were chosen because of their simplicity; I find that they are the fastest ways to make a prototype. They can be switched to a free, local and open-source alternative if needed.
        -   Data is stored in .feather files because they are small enough not to need a sql database or vector database.
        -   We can use the dot product as as similarity measure because OpenAI embeddings are normalized.
-   _CV-Library_ is used as the data source as it is cited by the [UK Government](https://nationalcareers.service.gov.uk/careers-advice/advertised-job-vacancies).
-   BM25 is an industry-standard modified version of TF-IDF.
    -   https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
    -   https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/
    -   note that the similarity values do not mean much of anything in absolute. see
        -   https://datascience.stackexchange.com/questions/101862/cosine-similarity-between-sentence-embeddings-is-always-positive
        -   https://vaibhavgarg1982.medium.com/why-are-cosine-similarities-of-text-embeddings-almost-always-positive-6bd31eaee4d5
    -   for a production-grade implementation of this we might want to take a look at https://github.com/AmenRa/retriv

This tool seems to have some potential, but the way to use it is still a bit unclear.

As future directions, we could add the LLM permutation, add a visa sponsor bool column using ukvi data, adda a way to download the results as a csv and finally fix the ETL pipeline so that the scraping works in real-time.


## Insights

I must give the caveat that since the purpose of this tool is to find more interesting job offers for people searching for jobs, the assessment of its utility is fundamentally subjective. If we were to try to measure this in a more objective way, we would need to develop a benchmark database, which would be out of scope for this project. As such, we take the recourse of manual testing and reporting interesting findings. If one of the results of the search is useful but the rest is noise, we can already consider this successful.

We are comparing three different search methods: the original (unknown) one, our lexical search and our semantic search. For example:

...

## Semantic search

-   either very different from the original one

## Lexical search

-   bm25 looks similar to the semantic displacement and not that much to the semantic search

### Dimensionality reduction and clustering

The expectation here would be to look at the points closest to your text and see if it is localted in a particular cluster.

Most of the different databases show just a cloud of points that does not suggest anything in particular.

...

However, in two cases, there seems to be clear clusters which are not obvious from the keyword search: `fun` and `leadership`

...

...
