# Assignment 07

## Due: See Date in Moodle

In this assignment you will use intermediate and advanced features in R.

To receive a **full credit** for this assignment, you must complete **all** questions.

## This Week's Assignment

In this week's assignment, you will perform data wrangling. This includes, but is not limited to the following:
    
- converting strings to numbers

- adding a new column to a dataframe

### Notes

- Adhere to good programming practices, utilizing descriptive variable names, appropriate spacing for readability, and adding comments to your code. 

- Ensure written responses maintain correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

In this notebook we will be working with a slightly modified version of the skyscraper data that your class collected. The dataset can be accessed [**here**](https://docs.google.com/spreadsheets/d/1W0uRGIU43sMvQ1pANUtlSkFKeY_TMl9d3QZyegEhxOo/edit#gid=1105865786).


Let's get started!

In [None]:
ss <- read.csv('data/skyscrapers_world.csv')
head(ss)

## Data Wrangling and Cleaning in R

### `dplyr`

`dplyr` is an R package from the `tidyverse` that is designed for data manipulation and transformation. Key functions in `dplyr` for performing data moves include `filter()`, `select()`, `mutate()`, `arrange()`, and the combination of `group_by()` with `summarize()`.

**Question 1.** Load the `dplyr` package.

In [None]:
library(dplyr)

Let's replicate the data cleaning and wranlging we performed in Python using R instead.

**Example 1.** Drop the `X` column.

In [None]:
ss <- ss %>% 
        select(-'X')

colnames(ss)

**Example 2.** Rename the `status.completed`, `status.started`, and `height in meters` columns to `snake_case` format.

In [None]:
ss <- ss %>% 
        rename(year_completed = 'status.completed',
               year_started = 'status.started',
               height_meters = 'height.in.meters'
)

colnames(ss)

**Example 3.** Standardize the `height` column by keeping only values in meters and converting the data type from character to numeric.


**Notes:**

```
gsub(pattern, replacement, x)
```
- The `gsub()` function in R is used for pattern replacement within character strings. It searches for a specified pattern or regular expression in a string and replaces it with a given value.

- Here, it searches for commas (`","`) in the `height_meters` column and replaces them with an empty string (`"`").

```
as.numeric(...)
```

- The `as.numeric()` function converts a character string into a numeric data type.

- After `gsub()` removes the commas, the cleaned value (which is still a string) is converted into a numeric format.

In [None]:
ss <- ss %>%
  mutate(clean_height_meters = as.numeric(gsub(",", "", height_meters)))

**Example 4.** Confirm that the code functions as expected.

In [None]:
...

**Example 5.** Identify non-numeric values in the `floors` column and convert them to numeric values.

In [None]:
...

In [None]:
ss$floors[49] <- 103
ss$floors[62] <- 73
ss$floors

In [None]:
ss$floors <- as.numeric(ss$floors)
ss$floors

**Question 2.** Filter the `data.frame` to include only rows where the `country` column matches one of the values in `uae`. Then, extract the row indices of these filtered rows.

In [None]:
uae <- c("United Arab Emirates", "United Arab Emirates (UAE)", "Dubai", "UAE")
...

**Note:** The `which()` function in R returns the indices of TRUE values in a logical vector. It is commonly used to find the positions of elements that satisfy a certain condition.

In [None]:
for (i in rows) {
  previous_label <- ss[i, "country"]
  ss[i, "country"] <- "UAE"
  cat(sprintf("Row index: %-5d Previous label: %-30s New label: %-20s\n",
              i, previous_label, ss[i, "country"]))
}

**Note:** The `sprintf()` function in R is used for formatted string output, similar to Python’s `f-string` or `format()` method. It allows you to embed values into strings with specific formatting, such as fixed decimal places, padding, and alignment.

**Example 6.** Change the mislabeld value for China.

In [None]:
...

In [None]:
...

## Visualization in R

`ggplot2` is a data visualization library in R. It is part of the `tidyverse` and follows the **grammar of graphics**, which allows users to build plots layer by layer.

`ggplot2` uses the following structure:

```
ggplot(data, aes(x = ..., y = ...)) + 
  geom_xxx() + 
  additional_layers
```

- `ggplot(data, aes(x, y))`: Calls the `ggplot()` function to initialize the plot with data and aesthetics `(aes())`

- `geom_...()`: Specifies the type of plot (e.g., `geom_point()` for scatter plots)

- `+ additional_layers`: Can include labels, themes, colors, facets, etc.

Load the `ggplot2` library.

In [None]:
...

In [None]:
# Set default plot size for all plots.
# This ensures all plots have consistent sizing.
options(repr.plot.width = 12, repr.plot.height = 8) 

### Bar Chart

Let's look a a summary of the material column labels. 

**Example 7.** Create a table for the frequency counts in the `material` column.

In [None]:
tbl <- table(ss$material)
tbl

To create a plot in `ggplot2`, we first initialize a `ggplot` object with data and aesthetic mappings. Then, we add layers such as geoms (`geom_bar()`), labels, and themes to build the complete visualization.

**Question 4.** Create a bar chart to visualiza the distriubtion of the materials used in the `ss` dataframe.

- First create the `ggplot` object.

In [None]:
g <- ...

In [None]:
g

Then add the layers to the `ggplot` object `g`.

In [None]:
g + geom_bar()

After that you can customize the layers (in this case `geom_bar()`) by adding `fill = ...` to color the bars.

**Note:** A table of R colors can be found [here](https://sites.stat.columbia.edu/tzheng/files/Rcolor.pdf).

In [None]:
g + geom_bar(fill = ...)

You can also add a labels layer and change the theme.

In [None]:
g + geom_bar(fill = ..., color = ...) +
  labs(title = 'Distribution of Materials', x = 'Category', y = 'Count') +
  theme_classic()

You can also change the bin labels.

In [None]:
g + geom_bar(fill = ..., color = ...) +
  labs(title = 'Distribution of Materials', x = NULL, y = 'Count') +
  scale_x_discrete(labels = c('composite' = 'Composite', 
                              'concrete' = 'Concrete', 
                              'steel' = 'Steel', 
                              'steel/concrete' = 'Steel/Concrete')
                  ) +
  theme_classic()

The process for creating different types of plots in `ggplot2` follows a similar structure:  

1. Initialize the `ggplot` object: Define the dataset and aesthetics using `ggplot(data, aes(...))`.  

1. Add a geometric layer (`geom_...()`): Choose the appropriate geom for visualization (e.g., `geom_point()`, `geom_bar()`, `geom_line()`).  

1. Customize the geom layer: Modify colors, sizes, transparency, or other visual properties.  

1. Add labels and annotations:Use `labs()` to specify the title, axis labels, and captions.  

1. Add additional layers: Include facets, guides, or other modifications.  

1. Apply a theme: Choose a built-in theme (e.g., `theme_minimal()`, `theme_classic()`) or customize elements with `theme()`.  


### Histogram

**Question 5:** Create a histogram to visualizae the distribution of the `floors`.

In [None]:
g <- ...
g + geom_histogram()

In [None]:
g + geom_histogram(..., fill = 'lightblue', color = 'white') +
  labs(title = 'Distribution of Floors', x = NULL, y = 'Frequency') +
  theme_classic()

**Question 6.** Examine the histogram of floor counts in the dataset. What trends do you observe, and what insights can be drawn from the distribution?  

In particular, explore: 

- How the distribution of floors varies across the dataset.
  
- Potential factors that may influence floor counts (e.g., building type, region, or year built). 

- Whether any outliers or unusual patterns exist in the data.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

#### Facet

A facet in `ggplot2` is a way to split a plot into multiple subplots based on the values of one or more categorical variables. Each subplot (facet) shows a subset of the data, making it easier to compare distributions or trends across different groups.

**Example 8.** Create a faceted histogram to visualize the distribution of floors across different material types to compare of floor counts by material category.

In [None]:
g + geom_histogram(bins = 15, fill = ..., color = ...) +
  ... +
  labs(title = 'Distribution of Floors', x = NULL, y = 'Frequency') +
  theme_classic()

- `facet_wrap(~ material)`: Creates separate plots (facets) for each unique value in the material column.

- Each facet shares the **same axes**, but plots data independently for each category.

**Question 7:** Examine the faceted histogram of floor counts across material types. What trends do you observe, and what insights can be drawn from the distributions?  

In particular, explore:  

- How the distribution of floors varies across different material types.  

- Potential factors that may influence floor counts (e.g., building type, region, or year built).  

- Whether any outliers or unusual patterns exist within specific material categories.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

### Box Plots

A box plot can also be used to visualize the distribution of floors across different material types to compare of heights by material category.

**Example 9.** Create a box plot to compare the distribution of the numerical variable fare grouped by passenger ticket class status.

In [None]:
g <- ...

In [None]:
g + geom_boxplot(fill = ..., color = ...)

### Scatter Plot

We can create a scatter plot to visualize the relationship between height and floors.

**Question 8:** Create a scatter plot to visualize the relationship between height and floors.

In [None]:
g <- ...

In [None]:
g + geom_point(size = 2)

**Question 9:** Examine the scatter plot of height and floors. What trends do you observe, and what insights can be drawn from the relationship between these variables?  

In particular, explore:  

- How the relationship between height and the number of floors varies.  

- Potential factors that may influence this relationship (e.g., building type, region, or construction year).  

- Whether any outliers or unusual patterns exist.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

**Example 10.** Color the points in the scatter plot based on the material type.

In [None]:
g <- ggplot(data = ss, aes(x = clean_height_meters, y = floors, ...))

In [None]:
g + geom_point()

What do you notice? What do you wonder?

### Line Charts

When visualizing time series data, a line chart is typically more effective than a scatter plot because it helps show trends, patterns, and continuity over time.  

- Generally speaking a line chart is better for time series data because it:

  - shows trends and patterns over time (e.g., increasing, decreasing, seasonal variations).  

  - emphasizes the connection between data points (since time is continuous).  

  - helps identify cycles, peaks, and outliers more clearly.
  

- Sometimes a scatter plot might be used instead:

    - If the data points are sparse and do not follow a clear trend.  

    - When exploring individual time-stamped events rather than continuous trends.  

**Example 11.** Make a line chart for the skyscraper dataset.

In [None]:
g <- ggplot(data = ss, aes(x = year_completed, y = clean_height_meters))
g + geom_line()

What do you notice? What do you wonder?

#### Group and Aggregate

The code ran without errors, but the visualization lacks insight. Why?

Grouping buildings by completion year and aggregating heights (e.g., average or maximum) would better reveal trends over time.

**Example 12.** Group buildings by completion year and use the median height to capture yearly trends. Save the grouped dataframe to an object named `dat` with columns named `year_completed` and `median_height`.

In [None]:
dat <- ss %>%
         ... %>%
         ...(median_height = ...)
dat

Since the first and last two observations were entered incorectly we can slice them out. The `slice()` function from the `dplyr` to select specific rows from a data frame based on their index values.

In [None]:
dat %>% slice(2:31)

In [None]:
dat <- dat %>% slice(2:31)

In [None]:
str(ss)

**Example 13.** Change the `year_completed` column from charater  (`chr`) to numeric (`num`).

In [None]:
dat$year_completed <- as.numeric(dat$year_completed)

Now we can create the line plot.

In [None]:
g <- ggplot(data = dat, aes(x = year_completed, y = median_height))

In [None]:
g + geom_line()

In [None]:
dat <- dat %>% slice(5:31)

In [None]:
g <- ggplot(data = dat, aes(x = year_completed, y = median_height))

In [None]:
g + geom_line()

What if we did a scatter plot?

In [None]:
g <- ggplot(data = ss, aes(x = as.numeric(year_completed), y = clean_height_meters))

In [None]:
g + geom_point(size = 2)

**Question 10:** Which plot, the line plot or the scatter plot, is most useful for this data? Explain your choice by describing how it best represents trends or patterns and why it is the most effective visualization.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

## Submission

Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.

**Note:** Save the assignment before proceeding to download the file.

After downloading, locate the `.ipynb` file and upload **only** this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.