# ZHAW MatchMaking App - Guide from Scratch
- this guide summarizes all steps taken to create the databases and hosting the app on `Replit` (or any other platform)
- these steps are written down for reproducability, they are not optimized (redundancies included)

## Prerequisites
You need the `combined.json` file which includes a list of dictionaries of all user data in the following format:</br>
  ```
  {"name":"User Name","shorthandSymbol":"xxxx","profileURL":"https://www.zhaw.ch/de/ueber-uns/person/xxxx","position":"Dozent/in","institution":"ZHAW xxxx","location":"xxxx","phone":"xxxx","email":"xxxx@zhaw.ch","imageURL":"https://intra.zhaw.ch/forschungsdaten/portraet/images/xxxx.jpg"}
  ```

- Clone the [repository](https://github.com/przvlprd/zhaw-matchmaking-app), create a virtual environment, activate it and install the requirements as described in the *Local Setup / Prerequisites* section in the `README.md`.</br>
- Place the `combined.json` inside the `ingest` folder.</br>
- Place your **OpenAI API key** in the `.env` file inside your local folder.

### Setup MongoDB

- install [MongoDB Community Edition](https://www.mongodb.com/try/download/community)
- start up your local database, if you followed the default setup, on Windows you would run:
```shell
"C:\Program Files\MongoDB\Server\7.0\bin\mongod.exe" --dbpath="C:\data\db"
```
- connect to the database, e.g. via Compass or [VSCode Plugin](https://www.mongodb.com/products/tools/vs-code)
- once connected, create the database `zhaw_matchmaking`
  - create 2 collections `persons` and `profile_data`

### Scrape the profile data and fill your MongoDB with it

- assuming you are already running this inside your virtual environment with the dependencies installed, run the notebook inside the `ingest` folder:
    -  `Build Profile Database.ipynb` (simply run all cells in order from start to finish, **scraping will take around 2h**)

### Setup Pinecone

- register a free tier account on [pinecone.io](https://www.pinecone.io/) and create a project with any name
- create an **index** called `zhaw-matchmaking` with `1536` dimensions and `cosine` metric
- create a Pinecone **API key** and copy it to the `.env` file inside your local folder
    - make sure the other environment variables are set correctly, if you followed along, Pinecone environment and index should be equal to these:
        ```
        PINECONE_ENVIRONMENT="gcp-starter"
        PINECONE_INDEX="zhaw-matchmaking"
        ```
 
### Fill the Pinecone vector database with the profile data and metadata from your MongoDB

- with all environment variables set and your MongoDB still up and running, run the cell below
- the script uses a pipeline to get the raw profile data and user data as metadata from your MongoDB collections, creates preprocessed chunks and embeds these using OpenAI embeddings, finally storing them in your Pinecone index
- optional: to follow the steps taken in the script, there is another notebook in the [background material](https://github.com/przvlprd/zhaw-matchmaking-material/blob/main/2.%20Build%20Vector%20Database.ipynb) working with `ChromaDB` instead of Pinecone

In [None]:
# Run this cell or the script manually from your shell / IDE
%run ingest/ingest.py

### Run it locally...
- now with the vector database set up, you can either run the Panel app locally:

In [None]:
!panel serve app.py --autoreload --show

- or run the server for the REST API locally:

In [None]:
%run server.py

### ...or deploy the code
- e.g. using `Replit`:
    - register / log in to Replit
    - click on `+ Create Repl` and `Import from Github`
    - paste the repo url:
      ```
      https://github.com/przvlprd/zhaw-matchmaking-app
      ```
    - specify the startup script, e.g. like this:
      ```
      panel serve app.py --allow-websocket-origin=*
      ```
      </br>
- Known errors & workarounds
    - although Replit attempts to use the `requirements.txt` for setup, I had to manually install some modules again using the shell and `pip install <module>`
    - if there is an error due to failed local import, copy the content of `ingest/preprocess_profile_data.py` to `chain.py`
      and delete the following line inside `chain.py`:
      ```
      from ingest.preprocess_profile_data import preprocess_profile
      ```
      (this workaround is not needed when running locally)

### Feel free to reach out in case something is broken!