Scrape and save data

mine-cetinkaya-rundel · Jul 6, 2019 · fcd775a · fcd775a
1 parent e816f1b
commit fcd775a
Show file tree

Hide file tree

Showing 7 changed files with 401 additions and 0 deletions.
diff --git a/data-scrape.R b/data-scrape.R
@@ -0,0 +1,50 @@
+# load packages ----------------------------------------------------------------
+library(rvest)
+library(tidyverse)
+library(glue)
+library(tools)
+
+# read schedule page -----------------------------------------------------------
+page <- read_html("http://www.user2019.fr/talk_schedule/")
+
+# extract table ----------------------------------------------------------------
+tabs <- page %>%
+  html_table("td", header = TRUE)
+
+# function to process data -----------------------------------------------------
+process_schedule <- function(day_tab, day_name){
+
+  # remove unused columns ----
+  raw <- day_tab %>% select(-2, -Slides)
+
+  # create talks_long ----
+  talks_long <- raw %>%
+    slice(seq(1, nrow(raw), by = 2)) %>%
+    mutate(info = as.character(glue("{Title} <br><br> _{Speaker}_")))
+
+  # create talks_wide ----
+  talks_wide <- talks_long %>%
+    select(Time, Room, info) %>%
+    pivot_wider(names_from = Room, values_from = info) %>%
+    select(Time, `Concorde 1+2`, `Cassiopée`, `Caravelle 2`, 
+           `Saint-Exupéry`, `Ariane 1+2`, `Guillaumet 1+2`)
+
+  # create abstracts_long ----
+  abstracts_long <- raw %>%
+    slice(seq(2, nrow(raw), by = 2)) %>%
+    rename(Abstract = Time) %>%
+    select(Abstract) %>%
+    bind_cols(talks_long, .) %>%
+    mutate(Day = toTitleCase(day_name)) %>%
+    select(Day, Time, Title, Speaker, Abstract, Session, Room, Chair)
+
+  # write out ----
+  write_csv(talks_wide, glue("data/{day_name}_talks_wide.csv"))
+  write_csv(abstracts_long, glue("data/{day_name}_abstracts_long.csv"))
+
+}
+
+# pricess days -----------------------------------------------------------------
+process_schedule(tabs[[1]], "wed")
+process_schedule(tabs[[2]], "thu")
+process_schedule(tabs[[3]], "fri")
diff --git a/data/fri_abstracts_long.csv b/data/fri_abstracts_long.csv
diff --git a/data/fri_talks_wide.csv b/data/fri_talks_wide.csv
@@ -0,0 +1,15 @@
+Time,Concorde 1+2,Cassiopée,Caravelle 2,Saint-Exupéry,Ariane 1+2,Guillaumet 1+2
+09:15,Tools for Model-Based Clustering in R <br><br> _Bettina Grün_,NA,NA,NA,NA,NA
+10:25,Native Chrome Automation using R <br><br> _Christophe DervieuxRomain Lesur_,fxtract - Feature Extraction from Grouped Data <br><br> _Quay Au_,Adjusting reviewer scores for a fairer assessment via multi-faceted Rasch modelling <br><br> _Caterina Constantinescu_,The transition from conventional tools in banking to R <br><br> _Balazi Peter_,rGSAn: a R package dedicated to the gene set analysis using semantic similarity measures. <br><br> _Aarón Ayllón-BenítezPatricia Thebault_,NA
+10:30,Our journey with Shiny : some packages to enhance your applications <br><br> _Victor PerrierFanny Meyer_,Spatial Optimisation with OSRM and R <br><br> _Megan Beckett_,Penalized regressions to study multivariate linear models : the VariSel package. <br><br> _Marie Perrot-Dockès_,"R++, a new Graphical User Interface for R <br><br> _Christophe Genolini_",Pathway-VisualiseR: An Interactive Web Application for Visualising Gene Networks <br><br> _Goknur GinerAlexandra Garnham_,NA
+10:35,auth0: Secure Authentication in Shiny with Auth0 <br><br> _Julio Trecenti_,Anomaly detection in trivago <br><br> _Peter Brejcak_,"Maximum spacing estimation, a new method in fitdistrplus <br><br> _Christophe Dutang_",R in Pharma: A tailored approach to converting programmers to R in an industry resistant to change <br><br> _Kieran Martin_,Compiling a global database of sapflow measurements with R: Workflow and tools for the SAPFLUXNET database <br><br> _Víctor Granda_,NA
+10:40,Packaging shiny applications <br><br> _Maxim Nazarov_,Using R and the Tidyverse to Play Fantasy Baseball <br><br> _Angeline Protacio_,rama: an R interface to the GAMA agent-based modeling platform <br><br> _Marc Choisy_,Community Driven Data Science in Insurance <br><br> _Kevin Kuo_,Bayesian sequential integration within a preclinical PK/PD modeling framework using rstan package: Lessons learned <br><br> _Fabiola La Gamba_,NA
+10:45,Photon : Building an electron-shiny app using a simple RStudio add in. <br><br> _Abbas Rizvi_,Optimizing children sleeping time using regression and machine learning <br><br> _Alicja Fras_,RcppGreedySetCover: Scalable Set Cover <br><br> _Kaeding Matthias_,unconfUROS and one of its outputs vornoiTreemap <br><br> _Alexander Kowarik_,VICI: a Shiny app for accurate estimation of Vaccine Induced Cellular Immunogenicity with bivariate modeling <br><br> _Boris Hejblum_,NA
+10:50,Visualizing Huge Amounts of Fleet Data using Shiny and Leaflet <br><br> _Andreas Wittmann_,NA,The GPareto and GPGame packages for multi and many objective Bayesian optimization <br><br> _Mickaël Binois_,An R implementation of a model-based estimator  –  a UK case study <br><br> _Konstantinos Soulanis_,Tools for 3D/4D interactive visualisation of cells and biological tissue <br><br> _Marion Louveaux_,NA
+10:55,NA,NA,NA,Using advanced R packages for the visualisation of clinical data in a cancer hospital setting <br><br> _Roxane Legaie_,Analysis of laboratory test requests in a university hospital: A Shiny App for association analysis as a demand management tool <br><br> _Deniz Topcu_,NA
+11:30,How to win friends and write an open-source book <br><br> _Jakub NowosadRobin Lovelace_,Machine Learning Infrastructure at Netflix <br><br> _Savin Goyal_,prVis: a Novel Method for Visual Dimension Reduction <br><br> _Norman MatloffTiffany JiangWenxuan ZhaoRobert Tucker_,pak: a fresh approach to package installation <br><br> _Gábor Csárdi_,"timeseriesdb - Manage, Process and Archive Time Series with R and PostgreSQL <br><br> _Matthias Bannert_",Implementation and analysis design of an adaptive-outcome trial in R <br><br> _Alessio Crippa_
+11:48,Making sense of CRAN: Package and collaboration networks <br><br> _Ioannis Kosmidis_,Deploying machine learning models at scale <br><br> _Angus Taylor_,PLS for Big Data: A Unified Parallel Algorithm for Regularized Group PLS <br><br> _Benoit Liquet_,Summary of developments in R's data.table package <br><br> _Arun Srinivasan_,A feast of time series tools <br><br> _Rob Hyndman_,Advances in dose-response analysis <br><br> _Christian RitzJens C. Streibig_
+12:06,RWsearch: a package for CRAN users and task view maintainers <br><br> _Patrice Kiener_,Serverless Computing for R <br><br> _Christoph BodnerThomas Laber_,multiDA and genDA: Discriminant analysis methods for large scale and complex datasets <br><br> _Sarah Romanes_,Real-time file import with the vroom package <br><br> _Jim Hester_,tsbox: Class-Agnostic Time Series <br><br> _Christoph Sax_,The next generation of the survival package <br><br> _Terry Therneau_
+12:24,"Translating datasets using ""datalang"": the development of ""datos"" package for the R4DS Spanish translation <br><br> _Riva Quiroga_",A DevOps process for deploying R to production <br><br> _David Smith_,compboost: Fast and Flexible Component-Wise Boosting Framework <br><br> _Daniel Schalk_,A Future for R: Simplified Parallel and Distributed Processing <br><br> _Henrik Bengtsson_,RJDemetra: an R interface to JDemetra+ seasonal adjustment software <br><br> _Alain Quartier-La-Tente_,A flexible approach to time-to-event data analysis using case-base sampling <br><br> _Jesse Islam_
+12:42,R Consortium Working Groups <br><br> _Joseph Rickert_,Authentication and authorization in plumber with the sealr package <br><br> _Friedrike Preu_,How to speed-up VSURF (Variable Selection Using Random Forests)? <br><br> _Robin Genuer_,FastRCluster: running FastR from GNU-R <br><br> _Stepan Sindelar_,Experiences from dealing with missing values in sensor time series data <br><br> _Steffen Moritz_,The R package mixmeta: an extended mixed-effects framework for meta-analysis <br><br> _Antonio Gasparrini_
+14:15,'AI for Good' in the R and Python ecosystems <br><br> _Julien Cornebise_,NA,NA,NA,NA,NA