Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Data Science and the Nature of Data: Problem Solving

This notebook is a companion to [Data Science and the Nature of Data](Data-science-and-the-nature-of-data.ipynb) so please read that first.

## Problem

We will be working with [a set of flower images](https://drive.google.com/file/d/16OFwIazU-dnu27kzP08iuZvk_lnBK9Ak/view?usp=sharing).
As discussed in the last notebook, this is *unstructured data*.

A previous group has manually coded these data with three variables:

- PetalColor: unicolor or multicolor
- PetalShape: rounded or unrounded
- Size: small, medium, or large

In this session, you will load the data into a dataframe in Jupyter manipulate rows and columns of the data.

## Load the data into a dataframe 

Use [Data Science and the Nature of Data](Data-science-and-the-nature-of-data.ipynb) or the [Reference](Reference.ipynb) if you've forgotten any of these steps.

Import the `pandas` library, which lets us work with dataframes:

- `import pandas as pd`

In [3]:
import pandas as pd

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="i!#]:2XI=^qLb$e.|iwo">pd</variable></variables><block type="importAs" id="8u3elQqk_!6!WoHrlj}e" x="73" y="63"><field name="libraryName">pandas</field><field name="libraryAlias" id="i!#]:2XI=^qLb$e.|iwo">pd</field></block></xml>

Load a dataframe with the data in "datasets/flowers.csv" and display it:

- Set `dataframe` to `with pd do read_csv using "datasets/flowers.csv"`
- `dataframe` (to display)

In [4]:
dataframe = pd.read_csv('datasets/flowers.csv')

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable><variable id="i!#]:2XI=^qLb$e.|iwo">pd</variable></variables><block type="variables_set" id="YMNSvjU:9aS0`rADBobh" x="29" y="215"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="VALUE"><block type="varDoMethod" id="h[BIiU^0[[vbD`zoBn6+"><field name="VAR" id="i!#]:2XI=^qLb$e.|iwo">pd</field><field name="MEMBER">read_csv</field><data>pd:read_csv</data><value name="INPUT"><block type="text" id="HyH?(x3/MuPXE`T5;)[@"><field name="TEXT">datasets/flowers.csv</field></block></value></block></value></block><block type="variables_get" id="uS,Sc{|xLGBqCK`F3*-*" x="8" y="300"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></xml>

Unnamed: 0,File,PetalColor,PetalShape,Size
0,0001.png,multicolor,rounded,medium
1,0002.png,unicolor,rounded,medium
2,0003.png,unicolor,unrounded,large
3,0004.png,multicolor,rounded,medium
4,0005.png,multicolor,rounded,small
...,...,...,...,...
205,0206.png,multicolor,rounded,large
206,0207.png,unicolor,rounded,large
207,0208.png,unicolor,unrounded,large
208,0209.png,multicolor,rounded,medium


**QUESTION:**

What are the variable types for each variable?

**ANSWER: (click here to edit)**

- PetalColor: nominal
- PetalShape: nominal
- Size: ordinal


Select the first 10 rows of the data:

- `in list dataframe get sub-list from first to 10`

In [5]:
dataframe[ : 10]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="lists_getSublist" id="5hFiqAq|QPxp%Xc:h/@A" x="8" y="518"><mutation at1="false" at2="true"></mutation><field name="WHERE1">FIRST</field><field name="WHERE2">FROM_START</field><value name="LIST"><block type="variables_get" id="Y`nWHnCXvg4!GpZ![Z8S"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="AT2"><block type="math_number" id="D1[i#!$i}_h;`?.h;x*w"><field name="NUM">10</field></block></value></block></xml>

Unnamed: 0,File,PetalColor,PetalShape,Size
0,0001.png,multicolor,rounded,medium
1,0002.png,unicolor,rounded,medium
2,0003.png,unicolor,unrounded,large
3,0004.png,multicolor,rounded,medium
4,0005.png,multicolor,rounded,small
5,0006.png,unicolor,rounded,small
6,0007.png,unicolor,rounded,medium
7,0008.png,unicolor,rounded,medium
8,0009.png,unicolor,rounded,small
9,0010.png,unicolor,rounded,small


Select the last 10 rows of the data:

- `in list dataframe get sub-list from # from end 10 to last`

In [6]:
dataframe[-10 : ]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="lists_getSublist" id="5hFiqAq|QPxp%Xc:h/@A" x="8" y="518"><mutation at1="true" at2="false"></mutation><field name="WHERE1">FROM_END</field><field name="WHERE2">LAST</field><value name="LIST"><block type="variables_get" id="Y`nWHnCXvg4!GpZ![Z8S"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="AT1"><block type="math_number" id="D1[i#!$i}_h;`?.h;x*w"><field name="NUM">10</field></block></value></block></xml>

Unnamed: 0,File,PetalColor,PetalShape,Size
200,0201.png,unicolor,rounded,medium
201,0202.png,multicolor,unrounded,large
202,0203.png,multicolor,unrounded,large
203,0204.png,unicolor,unrounded,small
204,0205.png,unicolor,unrounded,medium
205,0206.png,multicolor,rounded,large
206,0207.png,unicolor,rounded,large
207,0208.png,unicolor,unrounded,large
208,0209.png,multicolor,rounded,medium
209,0210.png,unicolor,rounded,small


Select the middle 10 rows of the data:

- `in list dataframe get sub-list from # 95 to # 105`

In [7]:
dataframe[94 : 105]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="lists_getSublist" id="5hFiqAq|QPxp%Xc:h/@A" x="8" y="518"><mutation at1="true" at2="true"></mutation><field name="WHERE1">FROM_START</field><field name="WHERE2">FROM_START</field><value name="LIST"><block type="variables_get" id="Y`nWHnCXvg4!GpZ![Z8S"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></value><value name="AT1"><block type="math_number" id="D1[i#!$i}_h;`?.h;x*w"><field name="NUM">95</field></block></value><value name="AT2"><block type="math_number" id=":W_T5/Ic4zuYR5*I6#^l"><field name="NUM">105</field></block></value></block></xml>

Unnamed: 0,File,PetalColor,PetalShape,Size
94,0095.png,unicolor,unrounded,small
95,0096.png,unicolor,rounded,medium
96,0097.png,unicolor,unrounded,medium
97,0098.png,multicolor,rounded,medium
98,0099.png,unicolor,rounded,medium
99,0100.png,unicolor,rounded,medium
100,0101.png,multicolor,unrounded,medium
101,0102.png,unicolor,unrounded,medium
102,0103.png,unicolor,unrounded,small
103,0104.png,unicolor,rounded,small


Select the first two columns of the data:

- `dataframe[` using a list containing `"PetalColor"` and `"PetalShape"` `]`

In [8]:
dataframe[['PetalColor', 'PetalShape']]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="indexer" id="oVNd/g7vyxV[^cAbT$JX" x="8" y="300"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="INDEX"><block type="lists_create_with" id="^oGt7#=i@OMTRwaf:kdd"><mutation items="2"></mutation><value name="ADD0"><block type="text" id="=k__Y{.gxL5z.AJpll.A"><field name="TEXT">PetalColor</field></block></value><value name="ADD1"><block type="text" id="uQyKHkAg(%zJh!AAX=[#"><field name="TEXT">PetalShape</field></block></value></block></value></block></xml>

Unnamed: 0,PetalColor,PetalShape
0,multicolor,rounded
1,unicolor,rounded
2,unicolor,unrounded
3,multicolor,rounded
4,multicolor,rounded
...,...,...
205,multicolor,rounded
206,unicolor,rounded
207,unicolor,unrounded
208,multicolor,rounded


Select the last two columns of the data:

- `dataframe[` using a list containing `"PetalShape"` and `"Size"` `]`

In [9]:
dataframe[['PetalShape', 'Size']]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="indexer" id="oVNd/g7vyxV[^cAbT$JX" x="8" y="300"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="INDEX"><block type="lists_create_with" id="^oGt7#=i@OMTRwaf:kdd"><mutation items="2"></mutation><value name="ADD0"><block type="text" id="=k__Y{.gxL5z.AJpll.A"><field name="TEXT">PetalShape</field></block></value><value name="ADD1"><block type="text" id="uQyKHkAg(%zJh!AAX=[#"><field name="TEXT">Size</field></block></value></block></value></block></xml>

Unnamed: 0,PetalShape,Size
0,rounded,medium
1,rounded,medium
2,unrounded,large
3,rounded,medium
4,rounded,small
...,...,...
205,rounded,large
206,rounded,large
207,unrounded,large
208,rounded,medium


Select the middle column of the data:

- `dataframe[` using a list containing `"PetalShape"` `]`

In [10]:
dataframe[['PetalShape']]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="indexer" id="oVNd/g7vyxV[^cAbT$JX" x="8" y="300"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="INDEX"><block type="lists_create_with" id="^oGt7#=i@OMTRwaf:kdd"><mutation items="1"></mutation><value name="ADD0"><block type="text" id="=k__Y{.gxL5z.AJpll.A"><field name="TEXT">PetalShape</field></block></value></block></value></block></xml>

Unnamed: 0,PetalShape
0,rounded
1,rounded
2,unrounded
3,rounded
4,rounded
...,...
205,rounded
206,rounded
207,unrounded
208,rounded


Get the data types

- `from dataframe get dtypes`

In [11]:
dataframe.dtypes

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varGetProperty" id="BH7oEB_9wN^Cx==6^_[a" x="8" y="142"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><field name="MEMBER">dtypes</field><data>dataframe:dtypes</data></block></xml>

File          object
PetalColor    object
PetalShape    object
Size          object
dtype: object

**QUESTION:**

What do the data types tell you about the variable types?

**ANSWER: (click here to edit)**

Nothing, it just shows as `object`


**QUESTION:**

Suppose the students who coded this data made mistakes with Size, so some `small` were coded as `medium` and some `medium` as `large`. 
Would that affect reliability, validity, or both?

**ANSWER: (click here to edit)**

It would only affect reliability, because `size` would still be measured, it just wouldn't be as accurate.


**QUESTION:**

Suppose someone made a mistake preparing the data and typed a row like this:

`0100.png,,unicolor,rounded,medium`

What do you think would happen?

**ANSWER: (click here to edit)**

This would shift color, shape and size columns to the right. 
Hopefully `pandas` would throw an error to let us know this happened.
