Copyright 2022 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Data Science and the Nature of Data: Problem Solving

This notebook is a companion to [Data Science and the Nature of Data](Data-science-and-the-nature-of-data.ipynb) so please read that first.

## Problem

We will be working with [a set of flower images](https://drive.google.com/file/d/16OFwIazU-dnu27kzP08iuZvk_lnBK9Ak/view?usp=sharing).
As discussed in the last notebook, this is *unstructured data*.

A previous group has manually coded these data with three variables:

- PetalColor: unicolor or multicolor
- PetalShape: rounded or unrounded
- Size: small, medium, or large

In this session, you will load the data into a dataframe in Jupyter manipulate rows and columns of the data.

## Load the data into a dataframe 

Use [Data Science and the Nature of Data](Data-science-and-the-nature-of-data.ipynb) if you've forgotten any of these steps.

Import the `readr` library, which lets us work with dataframes:

- `library readr`

In [1]:
library(readr)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="(cA1)X2lCPQio$W{:j4y">readr</variable></variables><block type="import_R" id="Gv_0#Q!yW+GN:NDyn9P/" x="16" y="10"><field name="libraryName" id="(cA1)X2lCPQio$W{:j4y">readr</field></block></xml>

Load a dataframe with the data in "datasets/flowers.csv" and display it:

- Set `dataframe` to `with readr do read_csv using "datasets/flowers.csv"`
- `dataframe` (to display)

In [2]:
dataframe = readr::read_csv("datasets/flowers.csv")

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable><variable id="(cA1)X2lCPQio$W{:j4y">readr</variable></variables><block type="variables_set" id="aEmL:SB)NF-^e4,:*KEN" x="17" y="204"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="VALUE"><block type="varDoMethod_R" id="!AFB9x,(K:Q.,_2FYY+u"><mutation items="1"></mutation><field name="VAR" id="(cA1)X2lCPQio$W{:j4y">readr</field><field name="MEMBER">read_csv</field><data>readr:read_csv</data><value name="ADD0"><block type="text" id="]^)Tk(d-R3[)=xBi|9=?"><field name="TEXT">datasets/flowers.csv</field></block></value></block></value></block><block type="variables_get" id="3?lrwsCvbw.I,.6Ab_k/" x="13" y="283"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></xml>

[1mRows: [22m[34m210[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): File, PetalColor, PetalShape, Size

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


File,PetalColor,PetalShape,Size
<chr>,<chr>,<chr>,<chr>
0001.png,multicolor,rounded,medium
0002.png,unicolor,rounded,medium
0003.png,unicolor,unrounded,large
0004.png,multicolor,rounded,medium
0005.png,multicolor,rounded,small
0006.png,unicolor,rounded,small
0007.png,unicolor,rounded,medium
0008.png,unicolor,rounded,medium
0009.png,unicolor,rounded,small
0010.png,unicolor,rounded,small


**QUESTION:**

What are the variable types for each variable?

**ANSWER: (click here to edit)**

- PetalColor: nominal
- PetalShape: nominal
- Size: ordinal


Load the `dplyr` library for row/col selection:

- `library dplyr`

In [4]:
library(dplyr)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable></variables><block type="import_R" id="K.nY/JzUnFt!B~)xw=j7" x="16" y="10"><field name="libraryName" id="LiPrc==C!jd{;fWA-(}6">dplyr</field></block></xml>

“package ‘dplyr’ was built under R version 4.1.3”

Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Select the first 10 rows of the data:

- with `dplyr` do `slice` using `dataframe` and `1:10` 

In [5]:
dplyr::slice(dataframe,1:10)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="$|@c}XUhYyE.FtSAhaE%" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">slice</field><data>dplyr:slice</data><value name="ADD0"><block type="variables_get" id="0^vw_mn4A5MRT%xD%Qm("><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="O?ev4Bf5{{xYFXU=Cc_Y"><field name="CODE">1:10</field></block></value></block></xml>

File,PetalColor,PetalShape,Size
<chr>,<chr>,<chr>,<chr>
0001.png,multicolor,rounded,medium
0002.png,unicolor,rounded,medium
0003.png,unicolor,unrounded,large
0004.png,multicolor,rounded,medium
0005.png,multicolor,rounded,small
0006.png,unicolor,rounded,small
0007.png,unicolor,rounded,medium
0008.png,unicolor,rounded,medium
0009.png,unicolor,rounded,small
0010.png,unicolor,rounded,small


Select the last 10 rows of the data.
You can do this by looking at how many rows there are in the dataframe when you loaded it:

- with `dplyr` do `slice` using `dataframe` and `200:210` 

In [6]:
dplyr::slice(dataframe,200:210)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="$|@c}XUhYyE.FtSAhaE%" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">slice</field><data>dplyr:slice</data><value name="ADD0"><block type="variables_get" id="0^vw_mn4A5MRT%xD%Qm("><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="O?ev4Bf5{{xYFXU=Cc_Y"><field name="CODE">200:210</field></block></value></block></xml>

File,PetalColor,PetalShape,Size
<chr>,<chr>,<chr>,<chr>
0200.png,unicolor,unrounded,small
0201.png,unicolor,rounded,medium
0202.png,multicolor,unrounded,large
0203.png,multicolor,unrounded,large
0204.png,unicolor,unrounded,small
0205.png,unicolor,unrounded,medium
0206.png,multicolor,rounded,large
0207.png,unicolor,rounded,large
0208.png,unicolor,unrounded,large
0209.png,multicolor,rounded,medium


Select the middle 10 rows of the data:

- with `dplyr` do `slice` using `dataframe` and `96:105` 

In [7]:
dplyr::slice(dataframe,96:105)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="$|@c}XUhYyE.FtSAhaE%" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">slice</field><data>dplyr:slice</data><value name="ADD0"><block type="variables_get" id="0^vw_mn4A5MRT%xD%Qm("><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="O?ev4Bf5{{xYFXU=Cc_Y"><field name="CODE">96:105</field></block></value></block></xml>

File,PetalColor,PetalShape,Size
<chr>,<chr>,<chr>,<chr>
0096.png,unicolor,rounded,medium
0097.png,unicolor,unrounded,medium
0098.png,multicolor,rounded,medium
0099.png,unicolor,rounded,medium
0100.png,unicolor,rounded,medium
0101.png,multicolor,unrounded,medium
0102.png,unicolor,unrounded,medium
0103.png,unicolor,unrounded,small
0104.png,unicolor,rounded,small
0105.png,multicolor,unrounded,large


Select the first two columns of the data:

- with `dplyr` do `select` using `dataframe` and `PetalColor:PetalShape` 

In [8]:
dplyr::select(dataframe,PetalColor:PetalShape)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="eGwb`#8yU[:`w{muZddM" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">select</field><data>dplyr:select</data><value name="ADD0"><block type="variables_get" id="k(}Ej}(Zx@:mO9D!9z~#"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="BOQ$uM/td%JJik~_Lc(X"><field name="CODE">PetalColor:PetalShape</field></block></value></block></xml>

PetalColor,PetalShape
<chr>,<chr>
multicolor,rounded
unicolor,rounded
unicolor,unrounded
multicolor,rounded
multicolor,rounded
unicolor,rounded
unicolor,rounded
unicolor,rounded
unicolor,rounded
unicolor,rounded


Select the last two columns of the data:

- with `dplyr` do `select` using `dataframe` and `PetalShape:Size` 

In [9]:
dplyr::select(dataframe,PetalShape:Size)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="eGwb`#8yU[:`w{muZddM" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">select</field><data>dplyr:select</data><value name="ADD0"><block type="variables_get" id="k(}Ej}(Zx@:mO9D!9z~#"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="BOQ$uM/td%JJik~_Lc(X"><field name="CODE">PetalShape:Size</field></block></value></block></xml>

PetalShape,Size
<chr>,<chr>
rounded,medium
rounded,medium
unrounded,large
rounded,medium
rounded,small
rounded,small
rounded,medium
rounded,medium
rounded,small
rounded,small


Select the middle column of the data:

- with `dplyr` do `select` using `dataframe` and `PetalShape` 

In [11]:
dplyr::select(dataframe,PetalShape)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LiPrc==C!jd{;fWA-(}6">dplyr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="eGwb`#8yU[:`w{muZddM" x="8" y="176"><mutation items="2"></mutation><field name="VAR" id="LiPrc==C!jd{;fWA-(}6">dplyr</field><field name="MEMBER">select</field><data>dplyr:select</data><value name="ADD0"><block type="variables_get" id="k(}Ej}(Zx@:mO9D!9z~#"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock_R" id="BOQ$uM/td%JJik~_Lc(X"><field name="CODE">PetalShape</field></block></value></block></xml>

PetalShape
<chr>
rounded
rounded
unrounded
rounded
rounded
rounded
rounded
rounded
rounded
rounded


Get the data types

- with `readr` do `spec` using `dataframe`

In [12]:
readr::spec(dataframe)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="(cA1)X2lCPQio$W{:j4y">readr</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod_R" id="NBbE2M,V(Js`SQt{!7A6" x="-77" y="164"><mutation items="1"></mutation><field name="VAR" id="(cA1)X2lCPQio$W{:j4y">readr</field><field name="MEMBER">spec</field><data>readr:spec</data><value name="ADD0"><block type="variables_get" id="{sNlqY/W($*30Q^l00O@"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value></block></xml>

cols(
  File = [31mcol_character()[39m,
  PetalColor = [31mcol_character()[39m,
  PetalShape = [31mcol_character()[39m,
  Size = [31mcol_character()[39m
)

**QUESTION:**

What do the data types tell you about the variable types?

**ANSWER: (click here to edit)**

It shows as `character`, so it is ambiguous between nominal and ordinal

**QUESTION:**

Suppose the students who coded this data made mistakes with Size, so some `small` were coded as `medium` and some `medium` as `large`. 
Would that affect reliability, validity, or both?

**ANSWER: (click here to edit)**

It would only affect reliability, because `size` would still be measured, it just wouldn't be as accurate.


**QUESTION:**

Suppose someone made a mistake preparing the data and typed a row like this:

`0100.png,,unicolor,rounded,medium`

What do you think would happen?

**ANSWER: (click here to edit)**

This would shift color, shape and size columns to the right. 
Hopefully `readr` would throw an error to let us know this happened.


<!--  -->