## Introduction
<p><img src="https://assets.datacamp.com/production/project_1174/img/trendlines.jpg" alt="Image of two trendlines over time."></p>
<p>It’s important to stay informed about trends in programming languages and technologies. Knowing what languages are growing or shrinking can help you decide where to invest. </p>
<p>An excellent source to gain a better understanding of popular technologies is <a href="https://stackoverflow.com/">Stack Overflow</a>. Stack overflow is an online question-and-answer site for coding topics. By looking at the number of questions about each technology, you can get an idea of how many people are using it.</p>
<p>You'll be working with a dataset with one observation for each tag in each year. The dataset was downloaded from the <a href="https://data.stackexchange.com/">Stack Exchange Data Explorer</a>. Below you can find an overview of the data that is available to you:<br><br></p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/stack_overflow_data.csv</b></div>
<ul>
    <li><b>year:</b> The year the question was asked.</li>
    <li><b>tag:</b> A word or phrase that describes the topic of the question.</li>
    <li><b>number:</b> The number of questions with a certain tag in that year.</li>
    <li><b>year_total:</b> The total number of questions asked in that year.</li>
</ul>
    </div>
<p>From here on out, it will be your task to explore and manipulate the existing data until you are able to answer the questions described in the instructions panel. Feel free to add as many cells as necessary. Finally, remember that you are only tested on your answer, not on the methods you use to arrive at the answer!</p>
<p><em><strong>Note:</strong> If you haven't completed a DataCamp project before you should check out the <a href="https://projects.datacamp.com/projects/41">Intro to Projects</a> first to learn about the interface. In this project, you also need to know your way around data manipulation and visualization in the Tidyverse and it's recommended that you take a look at the course <a href="https://www.datacamp.com/courses/introduction-to-the-tidyverse">Introduction to the Tidyverse</a>.</em></p>

In [2]:
# The two questions to answer are the following:
# 1. What fraction of the total number of questions asked in 2019 had the R tag?
# 2. What were the five most asked-about tags in the last 5 years (2015-2020)?

In [3]:
# Load the readr, dplyr and ggplot2 packages
library(readr)
library(dplyr)
library(ggplot2)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [4]:
# Read the stack_overflow_data.csv dataset using read_csv() and store the result to the stack_overflow_data dataframe
stack_overflow_data <- read_csv("datasets/stack_overflow_data.csv")

# Print the stack_overflow_data dataframe
stack_overflow_data

Parsed with column specification:
cols(
  year = [32mcol_double()[39m,
  tag = [31mcol_character()[39m,
  number = [32mcol_double()[39m,
  year_total = [32mcol_double()[39m
)


year,tag,number,year_total
<dbl>,<chr>,<dbl>,<dbl>
2008,treeview,69,168541
2008,scheduled-tasks,30,168541
2008,specifications,21,168541
2008,rendering,35,168541
2008,http-post,6,168541
2008,static-assert,1,168541
2008,asp.net-ajax,159,168541
2008,collision-detection,10,168541
2008,systray,4,168541
2008,html-helper,20,168541


In [5]:
# Filter the stack_overflow_data, using the filter() verb from dplyr, for those entries whose year is 2019 and whose tag is "r", and store the result to the stack_overflow_r_2019_data dataframe
stack_overflow_data_r_2019 <- stack_overflow_data %>%
    filter((year == 2019) & (tag == "r"))

# Print the stack_overflow_data_r_2019 dataframe
stack_overflow_data_r_2019

year,tag,number,year_total
<dbl>,<chr>,<dbl>,<dbl>
2019,r,52249,5410632


In [6]:
# Add a percentage column to the stack_overflow_data_r_2019 dataframe, using the mutate() verb from dplyr, and overwrite the result to the stack_overflow_data_r_2019 dataframe
stack_overflow_data_r_2019 <- stack_overflow_data_r_2019 %>%
    mutate(percentage = (number / year_total) * 100)

# Print the overwritten stack_overflow_data_r_2019 dataframe
stack_overflow_data_r_2019

year,tag,number,year_total,percentage
<dbl>,<chr>,<dbl>,<dbl>,<dbl>
2019,r,52249,5410632,0.9656728


In [7]:
# Select the fifth row of the stack_overflow_data_r_2019 dataframe and store it to the r_percentage variable
r_percentage <- stack_overflow_data_r_2019[5]

# Print the r_percentage variable
r_percentage

# The answer to Question 1. is given by: r_percentage = 0.9656728

percentage
<dbl>
0.9656728


In [8]:
# Filter the stack_overflow_data, using the filter() verb from dplyr, for those entries whose year is between 2015 and 2020 (inclusive), and store the result to the stack_overflow_2015_2020_data dataframe
stack_overflow_2015_2020_data <- stack_overflow_data %>%
    filter((year >= 2015) & (year <= 2020))

# Print the stack_overflow_2015_2020_data dataframe
stack_overflow_2015_2020_data

year,tag,number,year_total
<dbl>,<chr>,<dbl>,<dbl>
2015,conda,151,6612772
2015,anonymous-types,86,6612772
2015,extended-ascii,27,6612772
2015,git-fsck,6,6612772
2015,textblob,40,6612772
2015,git-add,41,6612772
2015,interrupt-handling,86,6612772
2015,documentlistener,22,6612772
2015,turtle-rdf,16,6612772
2015,css-content,20,6612772


In [11]:
# Using the group_by() and summarize() verbs from dplyr, group the stack_overflow_2015_2020_data dataframe by tag, summarize it with a fraction column giving the sum of the number divided by the sum of the year total for each tag, arrange in descending order of fraction, extract the top 5 tags in that order, and store the result to the stack_overflow_2015_2020_highest_tags_data dataframe
stack_overflow_2015_2020_highest_tags_data <- stack_overflow_2015_2020_data %>%
    group_by(tag) %>%
    summarize(fraction = sum(number) / sum(year_total)) %>%
    arrange(desc(fraction)) %>%
    top_n(5)

# Print the stack_overflow_2015_2020_highest_tags_data dataframe
stack_overflow_2015_2020_highest_tags_data

Selecting by fraction


tag,fraction
<chr>,<dbl>
javascript,0.03812043
python,0.03296431
java,0.02727272
android,0.02046203
c#,0.02025986


In [13]:
# Select the tag column from the stack_overflow_2015_2020_highest_tags_data data frame and store it to the highest_tags variable
highest_tags <- stack_overflow_2015_2020_highest_tags_data$tag

# Print the highest_tags variable
highest_tags

# The answer to Question 2. is given by highest_tags = c("javascript", "python", "java", "android", "c#")

In [None]:
# Alternatively, for Question 2., one could have arrived at the answer using dplyr and ggplot2 as follows

stack_overflow_2015_2020_highest_tags_data <- stack_overflow_2015_2020_data %>%
    group_by(tag) %>%
    mutate(fraction = number / year_total) %>%
    arrange(desc(fraction)) %>%
    top_n(5)

ggplot(data = stack_overflow_2015_2020_highest_tags_data, aes(x = year, y = fraction, color = tag)) +
    geom_line()

Selecting by fraction
