# 1. Scraping Tabular Data 

In [4]:
# load the library (install the library if it's not already installed)
library(rvest)

In [8]:
# specify the URL of the page you want to scrape
url <- "https://www.the-numbers.com/movie/budgets/all"

In [9]:
# read the HTML content of the page
html_content <- read_html(url)

In [10]:
# use the html_nodes() function to extract specific elements, such as divs or tables, using CSS selectors
data <- html_content %>% 
  html_nodes("table") %>% 
  html_table()

In [11]:
# the extracted data will be stored as a data frame
# you can inspect the data frame using the head() function
head(data)

Unnamed: 0_level_0,ReleaseDate,Movie,ProductionBudget,DomesticGross,WorldwideGross
<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,"Dec 9, 2022",Avatar: The Way of Water,"$460,000,000","$641,726,731","$2,175,722,357"
2,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,794,731,755"
3,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,071,802","$1,045,713,802"
4,"Apr 22, 2015",Avengers: Age of Ultron,"$365,000,000","$459,005,868","$1,395,316,979"
5,"May 17, 2023",Fast X,"$340,000,000",$0,$0
6,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,064,615,817"
7,"Apr 25, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,359,754"
8,"May 24, 2007",Pirates of the Caribbean: At World’s End,"$300,000,000","$309,420,425","$960,996,492"
9,"Nov 13, 2017",Justice League,"$300,000,000","$229,024,295","$655,945,209"
10,"Oct 6, 2015",Spectre,"$300,000,000","$200,074,175","$879,077,344"


In this example, the html_nodes() function is used with the argument "table" to extract all tables from the HTML content of the page. The html_table() function is then used to convert the extracted HTML tables into R data frames.

Note that you may need to modify the CSS selector depending on the structure of the table on the webpage you're trying to scrape. You can inspect the HTML source code of the page to determine the correct CSS selector to use.

---

# 2. Scraping Non-Tabular Data

In [5]:
# Scrape the webpage
webpage <- read_html("https://archive.org/details/internetarchivebooks")

In [14]:
# Extract the text data
text_data <- html_text(html_nodes(webpage, "p"))

In [15]:
# Store the text data in a data frame
data_frame <- data.frame(text = text_data)

In [16]:
# the extracted data will be stored as a data frame
# you can inspect the data frame using the head() function
head(data_frame)

Unnamed: 0_level_0,text
Unnamed: 0_level_1,<chr>
1,"Due to a planned power outage on Friday, 1/14, between 8am-1pm PST, some services may be impacted."
2,Search the history of over 780 billion  web pages  on the Internet.
3,Capture a web page as it appears now for use as a trusted citation in the future.
4,Please enter a valid web address
5,Books contributed by the Internet Archive.
6,"Total Views 203,325,781 (Older Stats)"


---

In [2]:
library(dplyr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




In [6]:
# Use html_nodes to extract specific elements from the HTML
header_text <- html_nodes(webpage, "h1") %>% html_text()

In [7]:
# Print the extracted text
print(header_text)

  [1] "Internet Archive Books"                                                                                                                     
  [2] "Share This Collection"                                                                                                                      
  [3] "\n        \n          Filters\n        \n        \n          3,752,202          \n            RESULTS          \n        \n        \n      "
  [4] "Media TypeMedia Type"                                                                                                                       
  [5] "YearYear"                                                                                                                                   
  [6] "Topics & SubjectsTopics & Subjects"                                                                                                         
  [7] "CollectionCollection"                                                                                    

In [8]:
# Convert the extracted text into a data frame
header_df <- data.frame(header = header_text, stringsAsFactors = FALSE)

In [9]:
# Print the data frame
print(header_df)

                                                                                                                                         header
1                                                                                                                        Internet Archive Books
2                                                                                                                         Share This Collection
3   \n        \n          Filters\n        \n        \n          3,752,202          \n            RESULTS          \n        \n        \n      
4                                                                                                                          Media TypeMedia Type
5                                                                                                                                      YearYear
6                                                                                                            Topics & SubjectsTopics & S

---

# 3. Data Sources We Can Consider

https://archive.org/ **<- CHECK THIS OUT**

https://data.world/arcadeanalytics/best-500-albums-amazon-neptune

https://en.wikipedia.org/wiki/List_of_online_music_databases

---