# Cran Scraping Overview

The base url for all cran packages can be found [here](https://cran.r-project.org/web/packages/available_packages_by_name.html). A few general notes:
1. Cran requests that a cannonical link be used when referencing the individual package sites. Abiding by there request I use this as the base link when referencing a package page. The cannonical url is: 
* https://cran.r-project.org/package=&lt;package_name>
The literal url is (just for reference if it ever needs to be changed):
* https://cran.r-project.org/web/packages/&lt;package_name&gt;/index.html
2. The proces will first go to the base url:
* https://cran.r-project.org/web/packages/available_packages_by_name.html
From here it will find the summary table and go through each &lt;a&gt; tag and grab the href value (link). 
3. For each href it will visit the package page and scrape the necessary fields (title, abstract, etc.)

In [1]:
from bs4 import BeautifulSoup
import urllib.request
import re

# base_url = "https://cran.r-project.org/"
# cannonical base url for a given package
base_url = "https://cran.r-project.org/package="
# lists all packages
home_url = "https://cran.r-project.org/web/packages/available_packages_by_name.html"
# make request to webpage
response = urllib.request.urlopen(home_url)
r = response.read()
# make the soup (parse the page)
soup = BeautifulSoup(r, 'html.parser')

In [4]:
# go to the table with the specified summary field
# then find all <a> tags - one for each row
package_table = soup.find('table', summary="Available CRAN packages by name.").find_all('a')


In [2]:
# This is just a note in case the literal link needs to be used
# the cannonical link doesn't require this level of processing
################################################################

# base_url = "https://cran.r-project.org/"
# for packages in package_table:
#     print(packages.get_text())
#     print(re.sub('\.\./\.\./', base_url, packages.get('href')))

In [5]:
# for each <a> (href) -> go to the page and scrape the contents which are needed
# for the sake of this example I'm only taking the first package
# just remove [0:1] below to go through all packages

for package in package_table[0:1]:
    package_url = base_url + package.get_text()
    print(package_url)
    response = urllib.request.urlopen(package_url)
    r = response.read()
    soup = BeautifulSoup(r, 'html.parser')

https://cran.r-project.org/package=A3


1. Full Article Title
2. Type of Entry: A - article, D - Disambiguation Pages (list of articles), R - Redirects
3. Alias - (only applies to redirects)
4. Empty field (put requires place holder)
5. Categories - article can belong to multiple categories
6. Empty field (requires place holder)
7. Related Topics - list of links to be displayed
8. Empty field
9. External Links
10. Content of disambiguation page (only applies to disambiguation pages)
11. Image - link to image url
12. Abstract - contains all content you wish to display
13. URL - source domain
 

In [6]:
article_title  = soup.find('body').find_all('h2')[0].get_text()
entry_type     = 'A'
category       = 'R Package' # not sure about this category part
related_topics = '' # unsure about this since each package is different - would require manual knowledge of each package
external_links = '' # vignetttes? / docs? (can be scraped off the page)
disambig       = ''
image_url      = ''
abstract       = soup.find('body').find_all('p')[0].get_text()# site content
source         = package_url

# may need to replace the line return in the article title and or escape....
print('%r\t%r\t\t%r\t\t%r\t\t%r\t%r\t%r\t%r\t%r'% 
      (article_title, entry_type, category, related_topics,
       external_links, disambig, image_url, abstract, source))


'A3: Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels'	'A'		'R Package'		''		''	''	''	'Supplies tools for tabulating and analyzing the results of predictive models. The methods employed are applicable to virtually any predictive model and make comparisons between different methodologies straightforward.'	'https://cran.r-project.org/package=A3'


In [7]:
# just in case you want to see what 'soup' looks like: its just the html of the page:
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>CRAN - Package A3</title>
<link href="../../CRAN_web.css" rel="stylesheet" type="text/css"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<style type="text/css">
  table td { vertical-align: top; }
</style>
</head>
<body>
<h2>A3: Accurate, Adaptable, and Accessible Error Metrics for Predictive
Models</h2>
<p>Supplies tools for tabulating and analyzing the results of predictive models. The methods employed are applicable to virtually any predictive model and make comparisons between different methodologies straightforward.</p>
<table summary="Package A3 summary">
<tr>
<td>Version:</td>
<td>1.0.0</td>
</tr>
<tr>
<td>Depends:</td>
<td>R (≥ 2.15.0), <a href="../xtable/index.html">xtable</a>, <a href="../pbapply/index.html">pbapply</a></td>
</tr>
<tr>
<td>Suggests:</td>
<td><a href="../randomForest/in