Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Scraping all years #3

Open
lopezpedres opened this issue Sep 19, 2021 · 3 comments
Open

Data Scraping all years #3

lopezpedres opened this issue Sep 19, 2021 · 3 comments

Comments

@lopezpedres
Copy link
Owner

lopezpedres commented Sep 19, 2021

The last number in the link represents the school id. There are 4 schools:
15,25,35,45
Here is the link for 2021:
https://www.dgae.unam.mx/Licenciatura2021/resultados/15.html
Here is the link for 2017:
https://www.dgae.unam.mx/Febrero2017/resultados/35.html
Here is the link for 2016:
https://servicios.dgae.unam.mx/Febrero2016/resultados/35.html
Here is the link for 2015:
https://servicios.dgae.unam.mx/Febrero2015/resultados/15.html
Here is the link for 2014:
https://servicios.dgae.unam.mx/Febrero2014/resultados/15.html
Here is the link for 2013:
https://servicios.dgae.unam.mx/Febrero2013/resultados/25.html

@mate-h
Copy link
Contributor

mate-h commented Sep 20, 2021

Data sources from 1996 to present, HTML format:

https://web.archive.org/web/19970329111029/http://www.dgae.unam.mx/
https://www.dgae.unam.mx/admision/
https://web.archive.org/web/*/http://www.dgae.unam.mx*

Archival statistical agenda data source from 1959 to present, PDF format:
http://agendas.planeacion.unam.mx/

@mate-h
Copy link
Contributor

mate-h commented Sep 20, 2021

Collected all the available links from the Web Archive here:
https://storage.googleapis.com/mate-h.appspot.com/archive-links.csv

@mate-h
Copy link
Contributor

mate-h commented Sep 21, 2021

Need to gather all of working links. Link example:

https://www.dgae.unam.mx/Licenciatura2021/resultados/1/10400035.html

Schematic link:

https://www.dgae.unam.mx/:term/resultados/:areaId/:facultyId:buildingId.html
https://servicios.dgae.unam.mx/:term/resultados/:areaId/:facultyId:buildingId.html
  1. step one is to scrape all of the identifiers for:
  • Terms
  • Areas
  • Facilties
  • Buildings
  1. Construct all of the possible URLs using the ids and the link schema above. Link schema version depends on the term.
  2. Download the one by one using lynx and discard the documents that were not resolved. Take all of the tags into account (Special link in 1996 and Archive links, Servicio subdomain, etc.)
  3. Write scripts to process that data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants