Skip to content

leskeg-zz/tfm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IT Master Thesis

Project name: "Web Crawling and data analysis with Python"

Description: This project extract data from a turistic webpage using Scrapy. Output are 2 files: 'result.json' and 'region.json'. Then files are imported to MongoDB database called 'tfm'. Finally different gramatical analysis and clustering are made.

Technologies:

  • Scrapy Framework
  • MongoDB
  • Scikit-Learn

Dependencies:

  • python-virtualenv
  • mongodb
  • mongodb-tools
  • robomongo
  • lapack
  • blas
  • gcc-fortran

Setup:

  • virtualenv -p /usr/bin/python2.7 env
  • source env/bin/activate
  • pip install -r req.txt
  • In python console: import nltk; nltk.download('stopwords'); nltk.download('punkt')

Deployment:

cd groupon_parser

rm result.json; scrapy crawl grouponScrapy -o result.json -t json

rm region.json; scrapy crawl grouponRegionScrapy -o region.json -t json

Import Result to db:

mongoimport --db tfm --collection result --type json --file result.json --jsonArray --upsertFields url

python update_regions_db.py result.json

Analysis:

  • cd ../clustering/
  • python clus.py

Results are showed in console and clustering.html is created.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published