Skip to content

phucanhapril/craigslove

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Craigslove: Love in the Time of Craigslist

http://phucanhapril.github.io/craigslove/

craigslist web scraper

python craigslove_scraper.py <base url>

where the base url is newyork.craigslist.org, or sfbay.craigslist.org, etc.

This will scrape personals from Strictly Platonic (stp), Miscellaneous Romance (msr), w4w, w4m, m4w, and m4m.

Posts are saved to posts/<city>/<query>.csv (for example: results/providence/w4w.csv)

setup (mac instructions)

install Python 2.7 and pip

sudo pip install BeautifulSoup4  
brew install libxml2
pip install lxml (if that doesn’t work try this first: 'xcode-select --install')  

k-means clustering

ml/clustering/cluster.py

instructions for running within the ml/clustering directory:

python cluster.py -path data/city/sample -c 2 -plot True

This will perform 2-means clustering on each csv file in data/city/sample.
The resulting clusters in json format are saved to cluster_data/.
It will also create 2-dimensional scatter plots representing the clusters, saved to cluster_plot/.

note: ml/clustering/data/ is populated from the original raw posts/ by running ml/clustering/util/format_posts_for_clustering.py.