Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Data Science and the Nature of Data: Problem Solving

This notebook is a companion to [Data Science and the Nature of Data](Data-science-and-the-nature-of-data.ipynb) so please read that first.

## Flower data

We will first be working with [a set of flower images](https://drive.google.com/file/d/16OFwIazU-dnu27kzP08iuZvk_lnBK9Ak/view?usp=sharing).

A previous group has manually coded these data with three variables:

- PetalColor: unicolor or multicolor
- PetalShape: rounded or unrounded
- Size: small, medium, or large

Let's begin by loading the data and inspecting the types of variables.

Use [Data Science and the Nature of Data](Data-science-and-the-nature-of-data.ipynb) or the [Reference](Reference.ipynb) if you've forgotten any of these steps.

Import the `pandas` library, which lets us work with dataframes.

In [5]:
import pandas as _3Cselect_3E

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="LXC0-aK4$bR7Ebrt}Q[j">&lt;select&gt;</variable></variables><block type="importAs" id="8u3elQqk_!6!WoHrlj}e" x="73" y="63"><field name="libraryName">pandas</field><field name="VAR" id="LXC0-aK4$bR7Ebrt}Q[j">&lt;select&gt;</field></block></xml>

Load a dataframe with the data in "datasets/flowers.csv" and display it.

In [7]:
dataframe = pd.read_csv('datasets/flowers.csv')

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable><variable id="i!#]:2XI=^qLb$e.|iwo">pd</variable></variables><block type="variables_set" id="3v`CGfKaBAQlZxNrLh;g" x="93" y="206"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="VALUE"><block type="varDoMethod" id="_t%9/`1H3Fc{hR:y|JKO"><mutation items="1"></mutation><field name="VAR" id="i!#]:2XI=^qLb$e.|iwo">pd</field><field name="MEMBER">read_csv</field><data>pd:read_csv</data><value name="ADD0"><block type="text" id="Ac(8:^_P3%XN~/eCvim3"><field name="TEXT">datasets/flowers.csv</field></block></value></block></value></block><block type="variables_get" id="#KxUvi=Dm$}ah[(R}s(A" x="98" y="286"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field></block></xml>

Unnamed: 0,File,PetalColor,PetalShape,Size
0,0001.png,multicolor,rounded,medium
1,0002.png,unicolor,rounded,medium
2,0003.png,unicolor,unrounded,large
3,0004.png,multicolor,rounded,medium
4,0005.png,multicolor,rounded,small
...,...,...,...,...
205,0206.png,multicolor,rounded,large
206,0207.png,unicolor,rounded,large
207,0208.png,unicolor,unrounded,large
208,0209.png,multicolor,rounded,medium


**QUESTION:**

What are the variable types for each variable?

**ANSWER: (click here to edit)**


<hr>

Get the `dtypes` from the dataframe.

In [10]:
dataframe.dtypes

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varGetProperty" id="BH7oEB_9wN^Cx==6^_[a" x="8" y="142"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><field name="MEMBER">dtypes</field><data>dataframe:dtypes</data></block></xml>

File          object
PetalColor    object
PetalShape    object
Size          object
dtype: object

**QUESTION:**

What do the data types tell you about the variable types and why?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Suppose the students who coded this data made mistakes with Size, so some `small` were coded as `medium` and some `medium` as `large`. 
Would that affect reliability, validity, or both and why?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Suppose someone made a mistake preparing the data and typed a row like this:

`0100.png,,unicolor,rounded,medium`

What do you think would happen and why?

**ANSWER: (click here to edit)**


<hr>

Use `describe` on the dataframe.

In [12]:
dataframe.describe(include='all')

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod" id="PgC]TrEuai$*N/:Fl@hq" x="8" y="176"><mutation items="1"></mutation><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><field name="MEMBER">describe</field><data>dataframe:describe</data><value name="ADD0"><block type="dummyOutputCodeBlock" id="gCh37Jr:x/}|2.Jh~0J1"><field name="CODE">include='all'</field></block></value></block></xml>

Unnamed: 0,File,PetalColor,PetalShape,Size
count,210,210,210,210
unique,210,2,2,3
top,0001.png,unicolor,unrounded,medium
freq,1,129,107,93


**QUESTION:**

What is the most frequent petal color?

**ANSWER: (click here to edit)**


<hr>

## Flower data - dirty version

Now let's look at the same data, but a messed up version of it.

Your job is too figure out the problems!

Start by loading "datasets/flowers-dirty" into the dataframe.

In [14]:
dataframe = pd.read_csv('datasets/flowers-dirty.csv')

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable><variable id="i!#]:2XI=^qLb$e.|iwo">pd</variable></variables><block type="variables_set" id="3v`CGfKaBAQlZxNrLh;g" x="93" y="206"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="VALUE"><block type="varDoMethod" id="_t%9/`1H3Fc{hR:y|JKO"><mutation items="1"></mutation><field name="VAR" id="i!#]:2XI=^qLb$e.|iwo">pd</field><field name="MEMBER">read_csv</field><data>pd:read_csv</data><value name="ADD0"><block type="text" id="Ac(8:^_P3%XN~/eCvim3"><field name="TEXT">datasets/flowers-dirty.csv</field></block></value></block></value></block></xml>

Now describe it and compare to the original above.

In [16]:
dataframe.describe(include='all')

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="varDoMethod" id="PgC]TrEuai$*N/:Fl@hq" x="8" y="176"><mutation items="1"></mutation><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><field name="MEMBER">describe</field><data>dataframe:describe</data><value name="ADD0"><block type="dummyOutputCodeBlock" id="gCh37Jr:x/}|2.Jh~0J1"><field name="CODE">include='all'</field></block></value></block></xml>

Unnamed: 0,File,PetalColor,PetalShape,Size
count,210,210,209,210
unique,210,3,2,3
top,0001.png,unicolor,unrounded,medium
freq,1,129,129,93


**QUESTION:**

What problems do you see?

**ANSWER: (click here to edit)**

- Petal shape count is one less than the others.
- Petal color unique is one more than it should be.
- Petal color and shape have the same freq for the top category
  
<hr>

### Missing values

Use code to confirm the missing value.
First, get whether values are missing.

In [18]:
missing = dataframe.isnull()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y[%k.R#:6/St%z5mke*3">missing</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="variables_set" id="[cvG~Q`fmxO{{q)1VM4?" x="4" y="183"><field name="VAR" id="Y[%k.R#:6/St%z5mke*3">missing</field><value name="VALUE"><block type="varDoMethod" id="#coK8HQ6wgNvZ5{1HL){"><mutation items="1"></mutation><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><field name="MEMBER">isnull</field><data>dataframe:isnull</data></block></value></block></xml>

Now sum the missing values in each column.

In [20]:
missing.sum()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y[%k.R#:6/St%z5mke*3">missing</variable></variables><block type="varDoMethod" id="r*8K.}/8#$[x*@V/qN:g" x="8" y="176"><mutation items="1"></mutation><field name="VAR" id="Y[%k.R#:6/St%z5mke*3">missing</field><field name="MEMBER">sum</field><data>missing:sum</data></block></xml>

File          0
PetalColor    0
PetalShape    1
Size          0
dtype: int64

### Extra values

First, get the column you want to check.

In [23]:
pc = dataframe['PetalColor']

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="0MvLf}O4*leP;K0x-s,?">pc</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="variables_set" id="egQSB$%{GFk|}me18P|2" x="43" y="111"><field name="VAR" id="0MvLf}O4*leP;K0x-s,?">pc</field><value name="VALUE"><block type="indexer" id="Dbka$JDA,nFAp|kiO@Ld"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="INDEX"><block type="text" id="d6|Y|4N8Ai[C#mAxGHH;"><field name="TEXT">PetalColor</field></block></value></block></value></block></xml>

Now get the unique values in that column.

In [25]:
pc.unique()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="0MvLf}O4*leP;K0x-s,?">pc</variable></variables><block type="varDoMethod" id="1WVS^M{Q6B04+R4eW@kZ" x="8" y="176"><mutation items="1"></mutation><field name="VAR" id="0MvLf}O4*leP;K0x-s,?">pc</field><field name="MEMBER">unique</field><data>pc:unique</data></block></xml>

array(['multicolor', 'unicolor', 'multcolor'], dtype=object)

**QUESTION:**

What problems do you see and how would you fix it?

**ANSWER: (click here to edit)**

- Petal color has a misspelling of multicolor
- We could fix by editing the CSV
  
<hr>

### Duplicate variables

Compare the distributions of the two similar variables to see if they match.

Start by getting the first distribution.

In [27]:
pc.value_counts()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="0MvLf}O4*leP;K0x-s,?">pc</variable></variables><block type="varDoMethod" id="FPi518`a/Z$wu`S`yb7p" x="200" y="315"><mutation items="1"></mutation><field name="VAR" id="0MvLf}O4*leP;K0x-s,?">pc</field><field name="MEMBER">value_counts</field><data>pc:value_counts</data></block></xml>

PetalColor
unicolor      129
multicolor     80
multcolor       1
Name: count, dtype: int64

To get the same for the other variable, we must first get that column.

In [29]:
ps = dataframe['PetalShape']

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="xa3n`IM3Y^5L~Dn7ycGd">ps</variable><variable id="t[n^Fcp7,s93E17ZZ9J6">dataframe</variable></variables><block type="variables_set" id="jN@/rW]R%36py}xWqx7t" x="30" y="110"><field name="VAR" id="xa3n`IM3Y^5L~Dn7ycGd">ps</field><value name="VALUE"><block type="indexer" id="O[YQ?-8mYMC?wAT4g-P5"><field name="VAR" id="t[n^Fcp7,s93E17ZZ9J6">dataframe</field><value name="INDEX"><block type="text" id="Js]y_egVIzA/TI:c[R6e"><field name="TEXT">PetalShape</field></block></value></block></value></block></xml>

Now get the distribution.

In [33]:
ps.value_counts()

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="xa3n`IM3Y^5L~Dn7ycGd">ps</variable></variables><block type="varDoMethod" id="v:n[qOBvESROb]wBs;c6" x="200" y="315"><mutation items="1"></mutation><field name="VAR" id="xa3n`IM3Y^5L~Dn7ycGd">ps</field><field name="MEMBER">value_counts</field><data>ps:value_counts</data></block></xml>

PetalShape
unrounded    129
rounded       80
Name: count, dtype: int64

**QUESTION:**

How are these variables different? Should we consider them the same and why?

**ANSWER: (click here to edit)**


- Petal color has a misspelling of multicolor
- Petal size is missing a value
- Otherwise the distributions are the same
- We might consider them the same because they are so closely aligned with each other; we will explore this idea more in a future notebook

<hr>

<!--  -->