GitHub - guptarohit994/ECE143_group25_project: FA19 course final project

Directory Structure

crawling_and_parsing
- Contains all parsers and crawlers
- geckodriver
data
- Contains data crawled from CAPE, UCOP
- Also contains merged dataset that we created
statistical_analysis
- Contains modules and notebooks used for analysizing data
- Graphs.ipynb shows the figures that made it to the presentation
ECE_143_Group_25_Presentation.pdf
- The final set of slides
- Also contains some backup slides which were hard to let go :(

Data Crawling

UCOP

cd crawling_and_parsing
python ucop_crawler.py --output_dir . --location='Berkeley'

Sample output (downcounts to 0) -

year:2018, location:Berkeley, headless:True, geckodriver_path:./geckodriver, output_dir:.
=========================================================================================
Firefox Initialized in headless mode
Need to go through 589 more pages!!!
=========================================================================================

Usage -

usage: ucop_crawler.py [-h] [--location [LOCATION]] [--year [YEAR]]
                       [--headless [HEADLESS]]
                       [--geckodriver_path [GECKODRIVER_PATH]]
                       [--output_dir [OUTPUT_DIR]]

Crawls information from UCOP Annual Wages webpage

optional arguments:
  -h, --help            show this help message and exit
  --location [LOCATION]
                        location whose wages to crawl, default:'San Diego'
  --year [YEAR]         year for which data to crawl, default:2018
  --headless [HEADLESS]
                        mode in which to launch browser, default:True
  --geckodriver_path [GECKODRIVER_PATH]
                        path to the geckodriver executable,
                        default:'./geckodriver'
  --output_dir [OUTPUT_DIR]
                        dir where to place the generated csv,
                        default:'../data/csv/ucop/'

CAPE

CAPE database is locked behind a Single-Sign On (SSO). So, this crawler requires SSO username and password to be inserted in ./crawling_and_parsing/ucsd_triton_sso.py to initiate a login request and DUO push needs to be accepted for a successful login

cd crawling_and_parsing
python cape_crawler.py

Sample output (eg. for CSE department, we want to do a quick crawl and place generated csv in test dir) -

python cape_crawler.py --output_dir . --department 'CSE'

department:CSE, detailed:False, headless:True, geckodriver_path:./geckodriver, output_dir:.
=========================================================================================
Firefox Initialized in headless mode
*********** Initiating a SSO login ***********
*********** Successfully logged in ***********

Successfully parsed departments(1/1),        CSE results(2087 courses)        
=========================================================================================

Usage -

usage: cape_crawler.py [-h] [--department DEPARTMENT] [--detailed DETAILED]
                       [--headless HEADLESS]
                       [--geckodriver_path [GECKODRIVER_PATH]]
                       [--output_dir [OUTPUT_DIR]]

Crawls information for all courses in various departments on CAPE webpage

optional arguments:
  -h, --help            show this help message and exit
  --department DEPARTMENT
                        specific department whose data to crawl. Either 1 or
                        all, default:'all'
  --detailed DETAILED   form of data scraping. Detailed are much slower.
                        default:'False'
  --headless HEADLESS   mode in which to launch browser, default:'True'
  --geckodriver_path [GECKODRIVER_PATH]
                        path to the geckodriver executable,
                        default:'./geckodriver'
  --output_dir [OUTPUT_DIR]
                        dir where to place the generated csv,
                        default:'../data/csv/cape/'

Third Party Modules

The third party modules we used are mentioned in requirements.txt. To install these Python dependencies and modules -

pip install -r requirements.txt

Although geckodriver is present in this repository, it also needs Firefox Quantum (v63.0+) to work with which should be manually installed.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
crawling_and_parsing		crawling_and_parsing
data		data
statistical_analysis		statistical_analysis
.gitattributes		.gitattributes
.gitignore		.gitignore
ECE_143_Group_25_Presentation.pdf		ECE_143_Group_25_Presentation.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
title_image.png		title_image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Directory Structure

Data Crawling

UCOP

CAPE

Third Party Modules

About

Releases

Packages

Contributors 5

Languages

License

guptarohit994/ECE143_group25_project

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Directory Structure

Data Crawling

UCOP

CAPE

Third Party Modules

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages