In my last two posts ([here](/building-a-datacamp-archive/) and [here](/webscraping-datacamp-courses-ii/)), I discussed my efforts to scrape the [DataCamp](http://datacamp.com/) website so that I could have a record of the courses I've taken. In this final post, I'm putting all the pieces together, discussing how the script works and showing what the final product looks like. 

There's a lot of code here, so I've frontloaded it into the first half of the post. You can choose your own adventure. If you're interested in how the script works, go to the next section. If you're interested in seeing what the script does, skip to the section "In Action."

Before I dive in, I want to say how excited I am about this project. A year ago, I could barely print `"Hello world"`, but thanks to sites like [DataCamp](http://datacamp.com/), communities like [The Carpentries](https://carpentries.org) and a heavy dose of [Stack Overflow](https://stackoverflow.com), I'm doing things like this. So if you're interested in coding or data science but think its something you could never do, think again. Keep a growth mindset, and dive  in.

# How It Works

Here's how the script works. The user passes the link of the course they want to `get_whole_course()`. This function then uses `get_course_outline()` to get a list of the chapters and lessons in each course, `make_chapter_notes()` to create a text file for each chapter, and `download_chapter_slides()` to download a PDF of the chapter slides. It then creates a text file with a table of contents for each course. This allows for easy navegation of the course once all the text files are in my notes system (I use [nvALT](http://brettterpstra.com/projects/nvalt/) and [The Archive](http://zettelkasten.de)). Everything is organized using unqiue, 14-digit "z numbers," which are assigned to each course and chapter.

Here are the libraries I used, with a brief explanation of the role each one plays: 

In [92]:
from bs4 import BeautifulSoup					#For parsing the HTML
from collections import OrderedDict				#For storing chapter and lesson info
from subprocess import Popen, PIPE, STDOUT		#For accessing pandoc
from urllib.request import urlopen, urlretrieve	#For fetching the webpage, downloading PDF of slides
import json										#For dealing with website data that comes in JSON
import pandas as pd								#For create Zettelkasten-style filenames
import re										#For regex parsing of `sct`
import subprocess								#For accessing pandoc
import pprint									#For troubleshooting, looking through webpages, dictionaries, etc

And here's all the code you need to capture an entire course:

In [2]:
def get_whole_course(link, return_z=False):
    '''Receives course URL from user. Gets ordered dict of chapters/lessons. 
    Creates list of unique z_numbers, one for each chapter. 
    Creates txt file for each chapter, fills each file with lesson content.
    Gets course name. Creates table of contents.
    Feeds to: get_course_outline(), make_z_number(), make_chapter_notes(), 
    get_course_title(), create_table_of_contents(), download_chapter_slides()'''
    course_dictionary = get_course_outline(link)
    z_list = make_z_number(len(course_dictionary.items())+1)
    chapter_names_Z_list = []
    for chapter, lesson in course_dictionary.items():
        z_index = list(course_dictionary.keys()).index(chapter)
        filename = z_list[z_index] + ' ' + chapter + '.txt'
        chapter_names_Z_list.append(filename)
        chapter_link = [lesson][0][1][1]
        make_chapter_notes(filename, chapter_link)
    course_name = get_course_title(link)
    create_table_of_contents(course_dictionary, course_name, z_list)
    first_key = list(course_dictionary.keys())[0]
    one_link = course_dictionary[first_key][0][1]
    download_chapter_slides(chapter_names_Z_list, one_link)
    if return_z == True:
        return z_list[-1]
    
## Should get chapter PDF here. Not in make_chapter_notes(). Should make file name in the PDF function. Maybe make a l

def get_course_outline(link): # Still works
    '''Receives link to course landing page from get_whole_course(). 
    Returns ordered dict of chapters with lessons.'''
    html = urlopen(link)
    soup = BeautifulSoup(html, 'lxml')
    lesson_outline = soup.find_all(['h4', 'li'])
    chapters = OrderedDict()   # {chapter: [(lesson_name, lesson_link), ...], ...}
    for item in lesson_outline:
        attributes = item.attrs
        try:
            class_type = attributes['class'][0]
            if class_type == 'chapter__title':
                chapter = item.text.strip()
                chapters[chapter] = []
            if class_type == 'chapter__exercise':
                lesson_name = item.find('h5').text
                lesson_link = item.find('a').attrs['href']
                chapters[chapter].append((lesson_name, lesson_link))
        except KeyError:
            pass
    return(chapters)

def make_z_number(num):
    '''Takes int and returns a list of unique, 14-digit numbers. Only goes to 99.'''
    assert num < 100, 'Enter an int that is less than 100. Max list size is 99.'
    string_list = []
    z_index = 0
    for x in range(num):
        z_string = pd.to_datetime('now').strftime('%Y%m%d%H%S') #Did seconds instead of minutes to avoid repeats
        z_string = z_string + '{0:0>2}'.format(z_index)
        string_list.append(z_string)
        z_index += 1
    return string_list

def make_chapter_notes(filename, link):
    '''Receives filename and lesson link from get_whole_course().
    (Note that a link from any lesson in a chapter will work. 
    That is, any lesson link has all the information for the chapter.)
    Cycles through all lessons in chapter, converting each lesson and sub-exercise from HTML to Markdown.
    Prints all chapter content into text file.
    Feeds to: get_lesson_json(), NormalExercise_print(), BulletExercise_print(), 
    MultipleChoiceExercise_print(), download_chapter_slides()'''
    lesson_json = get_lesson_json(link)
    for item in lesson_json['exercises']['all']:
        if item['type'] == 'VideoExercise':
            pass
        elif item['type'] == 'NormalExercise':
            NormalExercise_print(item, filename)
        elif item['type'] == 'BulletExercise':
            BulletExercise_print(item, filename)
        elif item['type'] == 'MultipleChoiceExercise':
            MultipleChoiceExercise_print(item, filename)

def get_course_title(link): # still works
    html = urlopen(link)
    soup = BeautifulSoup(html, 'lxml')
    return soup.title.text

def create_table_of_contents(dictionary, course_name, z_list):
    '''Receives course dictionary, course name, and list of unique z_numbers from get_whole_course().
    Creates text file with contents of course, formatted in Markdown, with wiki-style links to each chapter.'''    
    filename = z_list[-1] + ' ' + course_name + '.txt'
    with open(filename, 'a') as f:
        for chapter, lessons in dictionary.items():
            z_index = list(dictionary.keys()).index(chapter)
            print('\n# ', '[[' + z_list[z_index] + ']]', chapter, '\n', file=f)
            for lesson_name, lesson_link in lessons:
                print("   *", lesson_name, file=f)
                
def NormalExercise_print(json, f):
    '''Works with make_chapter_notes. Parses NormalExercise type lessons and prints them in 
    markdown to file.
    Feeds to: convert_2_md(), get_success_msg().'''
    with open(f, 'a') as f:
        print('#', json['title'], '\n', file=f)
        print('## Exercise\n', file=f)
        print(convert_2_md(json['assignment']), file=f)
        print('## Instructions\n', file=f)
        print(convert_2_md(json['instructions'][:-2]), file=f)
        print('## Code\n', file=f)
        print('```\n' + convert_2_md(json['sample_code']).replace('\\', ''),'\n```\n', file=f)
        print('```\n' + convert_2_md(json['solution']).replace('\\', ''),'\n```\n', file=f)
        print(get_success_msg(json['sct']) + '\n', file=f)

def BulletExercise_print(json, f):
    '''Works with make_chapter_notes. Parses BulletExercises type lessons and prints them in 
    markdown to file.
    Feeds to: convert_2_md(), get_success_msg().'''
    with open(f, 'a') as f:
        print('# ' + json['title'], '\n', file=f)
        print('## Exercise\n', file=f)
        print(convert_2_md(json['assignment']), file=f)
        print('## Instructions & Code \n', file=f)  
        for item in json['subexercises']:
            print(convert_2_md(item['instructions']), file=f)
            print('```\n' + item['sample_code'] + '\n```\n', file=f)
            print('```\n' + item['solution'] + '\n```', file=f)
            print(get_success_msg(item['sct']) + '\n', file=f)

def MultipleChoiceExercise_print(json, f):
    '''Works with make_chapter_notes. Parses MultipleChoice type lessons and prints them in 
    markdown to file.
    Feeds to: convert_2_md(), get_correct_mc(), get_success_msg().'''
    with open(f, 'a') as f:
        print('# ' + json['title'], '\n', file=f)
        print('## Exercise\n', file=f)
        print(convert_2_md(json['assignment']), file=f)
        print("## Choices\n", file=f)
        for choice in json['instructions']:
            print('* ' + choice, file=f)
        print('\n**Correct answer: ' + get_correct_mc(json['sct']) + '**\n', file=f)
        print(get_success_msg(json['sct']) + '\n', file=f)

def convert_2_md(string): #Source: http://www.practicallyefficient.com/2016/12/04/pandoc-and-python.html
    '''Receives a string of HTML and use Pandoc to return string in Markdown.'''
    p = Popen(['pandoc', '-f', 'html', '-t', 'markdown', '--wrap=preserve'], stdout=PIPE, stdin=PIPE, stderr=STDOUT)
    text = p.communicate(input=string.encode('utf-8'))[0]
    text = unescape(text.decode('utf-8'))
    return text

def get_success_msg(string):
    '''Parses text from DataCamp `sct` JSON and returns the success message as a string.'''
    match = re.search(r'success_msg\("(.*?)"\)', string)
    if match != None:
        message = match.group(1)
        return message
    else:
        return ''

def get_correct_mc(string):
    '''Parses text from DataCamp `sct` JSON and correct answer for MultipleChoice type lessons. 
    Works with MultipleChoiceExercise_print'''
    match = re.search(r'test_mc\(correct = (\d),', string)
    if match != None:
        message = match.group(1)
        return message
    else:
        return ''

def download_chapter_slides(chapter_names, one_link):
    '''Receives the list of chaper_names and a link to the first lesson of the first chapter
    from get_whole_course(). Gets links for the PDF slides for each chapter, downloads each PDF,
    and saves it with the same name as the chapter txt file for easy indexing.'''
    course_json = get_lesson_json(one_link)
    pdf_links = []
    for item in course_json['course']['chapters']:
        pdf_links.append(item['slides_link'])
    pdf_tuples = list(zip(chapter_names, pdf_links))
    for t in pdf_tuples:
        filename = t[0].strip('.txt') + '.pdf'
        if t[1]:
            urlretrieve(t[1], filename)

def unescape(s): #Source: https://wiki.python.org/moin/EscapingHtml
    '''Receives string from convert_2_md() and unescapes non-ascii characters.'''
    s = s.replace("&lt;", "<")
    s = s.replace("&gt;", ">")
    s = s.replace("&amp;", "&")
    return s

Once the above code is loaded, all you have to do is grab the link for the course and run `get_whole_course()`.

And, as a bonus, you can use the below function to scrape a whole track. In the next section, I demo what it looks like to scrape the track that I'm currently working on.

In [3]:
def get_whole_track(link):
    html = urlopen(link)
    soup = BeautifulSoup(html, 'lxml')
    track = soup.find_all('a', attrs={'class':'course-block__link ds-snowplow-link-course-block'})
    track_title = soup.title.text
    courses = []

    for x in track:
        title = x.find('h4').text
        tail = x.attrs['href']
        url = 'https://www.datacamp.com' + tail
        courses.append((title, url))
    
    track_file = '20180815170711 ' + track_title + '.txt'  
    
    with open(track_file, 'a') as f:
        print('#', track_title + '\n', file=f)
        for course in courses:
            z_num = get_whole_course(course[1], return_z=True)
            course_name = '[[' + z_num + ']] ' + course[0]
            print('*', course_name, file=f)

In [19]:
link = 'https://www.datacamp.com/tracks/data-scientist-with-python'
get_whole_track(link)
print('Done!')

TypeError: list indices must be integers or slices, not str

Something is broken because they changed the site.

In [None]:
get_lesson_json(link)

The solution is [this](https://stackoverflow.com/questions/52894092/web-scraping-unknown-data-structure-json-nested-list-or-something-else):

> If you're only interested in parsing THIS page, the issue is with double quotes that are escaped. Removing them allows you to load the string as json and access all the lists and inner lists. Executing `json_text = json_text.replace('\\\\\"', '')` will do it for you. This is certainly not a final solution as next week the page may contain other escaped characters, but this is a good starting point for you to understand what is happening and experiment with different solutions

In [30]:
html = urlopen('https://campus.datacamp.com/courses/intro-to-python-for-data-science/chapter-1-python-basics?ex=2')
soup = BeautifulSoup(html, 'lxml')
string = soup.find('script', text=re.compile('window.PRELOADED_STATE')).string
json_text = string.strip('window.PRELOADED_STATE = "')[:-2]
json_text = BeautifulSoup(json_text, 'lxml').string.replace('\\\\\"', '')
lesson_json = json.loads(json_text)

So first I redefine `get_lesson_json()`.

In [11]:
def get_lesson_json(link):
    '''Receives lesson link from make_chapter_notes() and returns 
    the dictionary that holds all the information for the lesson's parent chapter.'''
    html = urlopen(link)
    soup = BeautifulSoup(html, 'lxml')
    string = soup.find('script', text=re.compile('window.PRELOADED_STATE')).string
    json_text = string.strip('window.PRELOADED_STATE = "')[:-2]
    json_text = BeautifulSoup(json_text, 'lxml').string.replace('\\\\\"', '')
    lesson_json = json.loads(json_text)
    return lesson_json

In [12]:
def make_chapter_notes(link):
    '''Receives filename and lesson link from get_whole_course().
    (Note that a link from any lesson in a chapter will work. 
    That is, any lesson link has all the information for the chapter.)
    Cycles through all lessons in chapter, converting each lesson and sub-exercise from HTML to Markdown.
    Prints all chapter content into text file.
    Feeds to: get_lesson_json(), NormalExercise_print(), BulletExercise_print(), 
    MultipleChoiceExercise_print(), download_chapter_slides()'''
    lesson_json = get_lesson_json(link)
    for item in lesson_json['exercises']['all']:
        if item['type'] == 'VideoExercise':
            print('Video')
        elif item['type'] == 'NormalExercise':
            print('Normal')
        elif item['type'] == 'BulletExercise':
            print('Bullet')
        elif item['type'] == 'MultipleChoiceExercise':
            print('multichoice')

In [16]:
html = urlopen('https://www.datacamp.com/courses/data-visualization-with-ggplot2-1')
soup = BeautifulSoup(html, 'lxml')

#string = soup.find('script', text=re.compile('window.PRELOADED_STATE')).string
#json_text = string.strip('window.PRELOADED_STATE = "')[:-2]
#json_text = BeautifulSoup(json_text, 'lxml').string.replace('\\\\\"', '')
#lesson_json = json.loads(json_text)

In [78]:
link = 'https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/chapter-2-data?ex=2'

In [79]:
json = get_lesson_json(link)

AttributeError: 'list' object has no attribute 'loads'

In [93]:
html = urlopen(link)
soup = BeautifulSoup(html, 'lxml')
string = soup.find('script', text=re.compile('window.PRELOADED_STATE')).string
json_text = string.strip('window.PRELOADED_STATE = "')[:-2]
json_text = BeautifulSoup(json_text, 'lxml').string.replace('\\\\\"', '')
lesson_json = json.loads(json_text)

['[',
 '"',
 '~',
 '#',
 'i',
 'M',
 '"',
 ',',
 '[',
 '"',
 'p',
 'r',
 'e',
 'F',
 'e',
 't',
 'c',
 'h',
 'e',
 'd',
 'D',
 'a',
 't',
 'a',
 '"',
 ',',
 '[',
 '"',
 '^',
 '0',
 '"',
 ',',
 '[',
 '"',
 'c',
 'o',
 'u',
 'r',
 's',
 'e',
 '"',
 ',',
 '[',
 '"',
 '^',
 '0',
 '"',
 ',',
 '[',
 '"',
 's',
 't',
 'a',
 't',
 'u',
 's',
 '"',
 ',',
 '"',
 'S',
 'U',
 'C',
 'C',
 'E',
 'S',
 'S',
 '"',
 ',',
 '"',
 'd',
 'a',
 't',
 'a',
 '"',
 ',',
 '[',
 '"',
 '^',
 ' ',
 '"',
 ',',
 '"',
 'i',
 'd',
 '"',
 ',',
 '7',
 '7',
 '4',
 ',',
 '"',
 't',
 'i',
 't',
 'l',
 'e',
 '"',
 ',',
 '"',
 'D',
 'a',
 't',
 'a',
 ' ',
 'V',
 'i',
 's',
 'u',
 'a',
 'l',
 'i',
 'z',
 'a',
 't',
 'i',
 'o',
 'n',
 ' ',
 'w',
 'i',
 't',
 'h',
 ' ',
 'g',
 'g',
 'p',
 'l',
 'o',
 't',
 '2',
 ' ',
 '(',
 'P',
 'a',
 'r',
 't',
 ' ',
 '1',
 ')',
 '"',
 ',',
 '"',
 'd',
 'e',
 's',
 'c',
 'r',
 'i',
 'p',
 't',
 'i',
 'o',
 'n',
 '"',
 ',',
 '"',
 'T',
 'h',
 'e',
 ' ',
 'a',
 'b',
 'i',
 'l',
 'i',
 't',
 'y'

In [80]:
test = json[1][1][1][5][1][3]

for x in test:
    print(x[4])

VideoExercise
PureMultipleChoiceExercise
NormalExercise
NormalExercise
VideoExercise
NormalExercise
MultipleChoiceExercise
VideoExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise


In [74]:
my_list = test[2]

for x in my_list:
    if len(str(x)) > 10:
        print(my_list.index(x), x, '\n')

4 NormalExercise 

6 <p>To get a first feel for <code>ggplot2</code>, let's try to run some basic <code>ggplot2</code> commands. Together, they build a plot of the <code>mtcars</code> dataset that contains information about 32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.</p> 

8 Exploring ggplot2, part 1 

10 # Load the ggplot2 package\n\n\n# Explore the mtcars data frame with str()\n\n\n# Execute the following command\nggplot(mtcars, aes(x = cyl, y = mpg)) +\n  geom_point() 

12 <ul>\n<li>Load the <code>ggplot2</code> package using <a href=http://www.rdocumentation.org/packages/base/functions/library><code>library()</code></a>. It is already installed on DataCamp's servers.</li>\n<li>Use <a href=http://www.rdocumentation.org/packages/utils/functions/str><code>str()</code></a> to explore the structure of the <code>mtcars</code> dataset.</li>\n<li>Hit <em>Submit Answer</em>. This will execute the

In [77]:
my_list = test[6]

for x in my_list:
    if len(str(x)) > 10:
        print(my_list.index(x), x, '\n')

4 MultipleChoiceExercise 

6 <p>In the previous exercise you saw that <code>disp</code> can be mapped onto a color gradient or onto a continuous size scale.</p>\n<p>Another argument of <a href=http://www.rdocumentation.org/packages/ggplot2/functions/aes><code>aes()</code></a> is the <code>shape</code> of the points. There are a finite number of shapes which <a href=http://www.rdocumentation.org/packages/ggplot2/functions/ggplot><code>ggplot()</code></a> can automatically assign to the points. However, if you try this command in the console to the right:</p>\n<pre><code>ggplot(mtcars, aes(x = wt, y = mpg, shape = disp)) +\n  geom_point()\n</code></pre>\n<p>It gives an error. What does this mean?</p> 

8 Understanding Variables 

12 ['<code>shape</code> is not a defined argument.', '<code>shape</code> only makes sense with categorical data, and <code>disp</code> is continuous.', '<code>shape</code> only makes sense with continuous data, and <code>disp</code> is categorical.', '<code>shap

So `json[1][1][1][1]` has the chapter information.

And `json[1][1][1][5][1][3]` has all the lesson information

# Questions I need to answer:

1. Are the exercises for a chapter always in `json[1][1][1][5][1][3]`?

> **YES**

In [98]:
chap1 = 'https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/chapter-1-introduction-d19c22c0-9d9c-4202-b2fb-8630796b7dde?ex=2'
chap2 = 'https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/chapter-2-data?ex=2'
chap3 = 'https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/chapter-3-aesthetics?ex=2'
chap4 = 'https://campus.datacamp.com/courses/data-visualization-with-ggplot2-1/chapter-4-geometries?ex=2'

chaplist = [chap1, chap2, chap3, chap4]

In [104]:
for c in chaplist:
    chapter = get_lesson_json(c)
    lessons = chapter[1][1][1][5][1][3]
    for l in lessons:
        print(l[4])

VideoExercise
PureMultipleChoiceExercise
NormalExercise
NormalExercise
VideoExercise
NormalExercise
MultipleChoiceExercise
VideoExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise
VideoExercise
NormalExercise
NormalExercise
NormalExercise
PureMultipleChoiceExercise
VideoExercise
MultipleChoiceExercise
VideoExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise
VideoExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise
MultipleChoiceExercise
VideoExercise
NormalExercise
NormalExercise
VideoExercise
NormalExercise
NormalExercise
VideoExercise
NormalExercise
NormalExercise
VideoExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise
VideoExercise
NormalExercise
NormalExercise
NormalExercise
NormalExercise


2. Is the list structure the same for exercises of the same type?

> **YES**

How to deal with NormalExercise?

In [159]:
for c in chaplist:
    chapter = get_lesson_json(c)
    lessons = chapter[1][1][1][5][1][3]
    for l in lessons:
        if l[4] == 'NormalExercise':
            print('#', l[8], '\n')
            print(convert_2_md(l[6]).replace('\n\n\\\\n', ''))
            print('## Instructions\n')
            print(convert_2_md(l[6]).replace('\n\n\\\\n', ''))
            print('## Code\n')
            print('```\n' + l[10].replace('\\n', '\n') + '\n```\n')
            print('```\n' + l[20].replace('\\n', '\n') + '\n```\n')

# Exploring ggplot2, part 1 

To get a first feel for `ggplot2`, let's try to run some basic `ggplot2` commands. Together, they build a plot of the `mtcars` dataset that contains information about 32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.

## Instructions

To get a first feel for `ggplot2`, let's try to run some basic `ggplot2` commands. Together, they build a plot of the `mtcars` dataset that contains information about 32 cars from a 1973 Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.

## Code

```
# Load the ggplot2 package


# Explore the mtcars data frame with str()


# Execute the following command
ggplot(mtcars, aes(x = cyl, y = mpg)) +
  geom_point()
```

```
# Load the ggplot2 package
library(ggplot2)

# Explore the mtcars data frame with str()
str(mtcars)

# Execute the following command
ggplot(mtcars, aes(x 

# base package and ggplot2, part 1 - plot 

These courses are about understanding data visualization in the context of the grammar of graphics. To gain a better appreciation of `ggplot2` and to understand how it operates differently from base package, it's useful to make some comparisons.

In the video, you already saw one example of how to make a (poor) multivariate plot in base package. In this series of exercises you'll take a look at a better way using the equivalent version in ggplot2.

First, let's focus on base package. You want to make a plot of `mpg` (miles per gallon) against `wt` (weight in thousands of pounds) in the `mtcars` data frame, but this time you want the dots colored according to the number of cylinders, `cyl`. How would you do that in base package? You can use a little trick to color the dots by specifying a `factor` variable as a color. This works because factors are just a special class of the `integer` type.

## Instructions

These courses are about understand

In the last exercise you saw how `iris.tidy` was used to make a specific plot. It's important to know how to rearrange your data in this way so that your plotting functions become easier. In this exercise you'll use functions from the `tidyr` package to convert `iris` to `iris.tidy`.

The resulting `iris.tidy` data should look as follows:

          Species  Part Measure Value\n    1  setosa Sepal  Length   5.1\n    2  setosa Sepal  Length   4.9\n    3  setosa Sepal  Length   4.7\n    4  setosa Sepal  Length   4.6\n    5  setosa Sepal  Length   5.0\n    6  setosa Sepal  Length   5.4\n    ...\n

You can have a look at the `iris` dataset by typing `head(iris)` in the console.

*Note:* If you're not familiar with `%>%`, `gather()` and `separate()`, you may want to take the [*Cleaning Data in R*](https://www.datacamp.com/courses/cleaning-data-in-r) course. In a nutshell, a dataset is called tidy when every row is an observation and every column is a variable. The `gather()` function moves 

Now that you've got some practice with incrementally building up plots, you can try to do it from scratch! The `mtcars` dataset is pre-loaded in the workspace.

## Instructions

Now that you've got some practice with incrementally building up plots, you can try to do it from scratch! The `mtcars` dataset is pre-loaded in the workspace.

## Code

```
# Map cyl to size



# Map cyl to alpha



# Map cyl to shape 



# Map cyl to label


```

```
# Map cyl to size
ggplot(mtcars, aes(x = wt, y = mpg, size = cyl)) +
  geom_point()

# Map cyl to alpha
ggplot(mtcars, aes(x = wt, y = mpg, alpha = cyl)) +
  geom_point()

# Map cyl to shape 
ggplot(mtcars, aes(x = wt, y = mpg, shape = cyl)) +
  geom_point()

# Map cyl to labels
ggplot(mtcars, aes(x = wt, y = mpg, label = cyl)) +
  geom_text()
```

# All about attributes, part 1 

In the video you saw that you can use all the aesthetics as attributes. Let's see how this works with the aesthetics you used in the previous exercises: `x`, `y`, `colo

In the last chapter you saw that all the visible aesthetics can serve as attributes and aesthetics, but I very conveniently left out x and y. That's because although you can make univariate plots (such as histograms, which you'll get to in the next chapter), a y-axis will always be provided, even if you didn't ask for it.

In the `base` package you can make univariate plots with [`stripchart()`](http://www.rdocumentation.org/packages/graphics/functions/stripchart) (shown in the viewer) directly and it will take care of a fake y axis for us. Since this is univariate data, there is no real y axis.

You can get the same thing in `ggplot2`, but it's a bit more cumbersome. The only reason you'd really want to do this is if you were making many plots and you wanted them to be in the same style, or you wanted to take advantage of an aesthetic mapping (e.g. colour).

## Instructions

In the last chapter you saw that all the visible aesthetics can serve as attributes and aesthetics, but I very 

In the chapter on aesthetics you saw different ways in which you will have to compensate for overplotting. In the video you saw a dataset that suffered from overplotting because of the precision of the dataset.

Another example you saw is when you have integer data. This can be continuous data measured on an `integer` (i.e. 1 ,2, 3 ...), as opposed to `numeric` (i.e. 1.1, 1.4, 1.5, ...), scale, or two categorical (e.g. `factor`) variables, which are just type `integer` under-the-hood.

In such a case you'll have a small, defined number of intersections between the two variables.

You will be using the `Vocab` dataset. The `Vocab` dataset contains information about the years of education and integer score on a vocabulary test for over 21,000 individuals based on US General Social Surveys from 1972-2004.

## Code

```
# Examine the structure of Vocab


# Basic scatter plot of vocabulary (y) against education (x). Use geom_point()



# Use geom_jitter() instead of geom_point()



# Using 

Overlapping histograms pose similar problems to overlapping bar plots, but there is a unique solution here: a frequency polygon.

This is a geom specific to binned data that draws a line connecting the value of each bin. Like [`geom_histogram()`](http://www.rdocumentation.org/packages/ggplot2/functions/geom_histogram), it takes a `binwidth` argument and by default `stat = bin` and `position = identity`.

## Code

```
# A basic histogram, add coloring defined by cyl
ggplot(mtcars, aes(mpg)) +
  geom_histogram(binwidth = 1)

# Change position to identity



# Change geom to freqpoly (position is identity by default)


```

```
# A basic histogram, add coloring defined by cyl
ggplot(mtcars, aes(mpg, fill = cyl)) +
  geom_histogram(binwidth = 1)

# Change position to identity
ggplot(mtcars, aes(mpg, fill = cyl)) +
  geom_histogram(binwidth = 1, position = identity)

# Change geom to freqpoly (position is identity by default)
ggplot(mtcars, aes(mpg, color = cyl)) +
  geom_freqpoly(binwidth 

By themselves, time series often contain enough valuable information, but you always want to maximize the number of variables you can show in a plot. This allows you (and your viewers) to begin making comparisons between those variables that would otherwise be difficult or impossible.

Here, you'll add shaded regions to the background to indicate recession periods. How do unemployment rate and recession period interact with each other?

In addition to the `economics` dataset from before, you'll also use the `recess` dataset for the periods of recession. The `recess` data frame contains 2 variables: the `begin` period of the recession and the `end`. It's already available in your workspace.

## Instructions

By themselves, time series often contain enough valuable information, but you always want to maximize the number of variables you can show in a plot. This allows you (and your viewers) to begin making comparisons between those variables that would otherwise be difficult or impossibl

How do I rewrite the exercises as a result?

In [None]:
print
    