A deep learning project for classifying cities from street view images using multiple model architectures. This project compares three different approaches to visual place recognition: a GSV-based model adapted from an existing open-source implementation, a vision-based transformer model, and our hybrid model, GeoSceneNet, which fuses computer-vision based scene descriptors with CNN image features.
Based on the GSV-Cities framework:
- Backbone: ResNet50
- Aggregation: ConvAP (Convolutional Aggregation Pooling)
- Fine-tuned with a classification head for city prediction
Uses OpenAI's CLIP model:
- Base Model:
openai/clip-vit-base-patch32 - Architecture: CLIP vision encoder with a linear classification head
- Leverages pre-trained vision-language representations
Our own custom model:
- Model: Fusion of CV scene descriptors and CNN image features (ResNet18)
- Classification head predicts off of these features
- Clone the repository:
git clone https://github.com/onoderamia/CS549.git
cd CS549- Prepare the GSV-cities repository:
git submodule init
git submodule updateThen comment out line 7 in gsv/gsv-cities/main.py:
# from dataloaders.GSVCitiesDataloader import GSVCitiesDataModule- Install dependencies:
pip install -r requirements.txt- Scrape your own data OR download our data from here
cd scraper
echo "API_KEY=[YOUR_GOOGLE_API_KEY]" > .env
python scraper.pyTrain the GSV model:
cd gsv
python train.py # Train from scratch
python train.py ../models/gsv.pth # Resume from checkpointTrain the VLM model:
cd vlm
python train.py # Train from scratch
python train.py ../models/vlm.pth # Resume from checkpointTrain the custom model:
cd custom
python train.pyAll training scripts save the model to the models/ directory. You can also use our pretrained models available here
Run the web server:
cd webapp
python app.pyA demonstration of the web application is available here. If you would like to test it on the same GeoGuessr map, you can also try it here.