> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses). 
>
>and inspect the value of ``counts`` and ``bins`` that you get as outpu. What do you think these are? Compare them to the plots above.

In [None]:
# Try the above code here, and interpret the output

### Step 6: Communicate your results

We need to wrap up our workflow with the final step of communicating our goal: what does the spread of the data looks like?

We won't provide a definitive answer, rather, we will give some phrases and sentences below that you might use if you had to write this in a report. ***Which of these are correct***, and ***which are incorrect***?

* The grades of the students are uniformly distributed (spread).
* The time taken is approximately symmetrically spread, with a center of 190 minutes.
* There is long tail observed in the student grades, with a <a href="https://www.itl.nist.gov/div898/handbook/eda/section3/histogr6.htm" target="_blank">tail (skew) to the right</a>.
* There is long tail observed in the student grades, with a <a href="https://www.itl.nist.gov/div898/handbook/eda/section3/histogr7.htm" target="_blank">tail (skew) to the left</a>
* There appear to be outliers in the `Time` taken variable, with some students taking an exceptionally short time.
* The median grade seems to be between 70 and 75%, and matches the value in the ``describe()`` table.

### Summary

You should, from the above exercise see the value of a histogram. New terminology is ***highlighted***.

<img src="images/general/Crystal_Clear_app_korganizer.png" style="width: 100px ; float:right"/>

* Histograms are a graphic summary of the spread of a single variable. We sometimes use the word "***distribution***" instead of "spread". The word ***scale*** is also used by statisticians as well for this concept.
* We get a good idea of the center of the data. Also called the ***location***.
* We can, depending on the number of bins, detect if there are outliers.
* It indicates if there is skew in the data; does the histogram have a tail to one side?
* We also see how many 'humps' there are in the data. Is everything collected in one hump, or are there two humps (peaks). This <a href="https://www.itl.nist.gov/div898/handbook/eda/section3/eda33e4.htm" target="_blank">webpage</a> shows an example of distribution with two peaks.

#### ➜ Challenge yourself: Random walks

Imagine a person walking. Every step forward also includes a small random amount to the left (negative values) or right (positive values).

We can model these values with numbers from a normal distribution, which is centered at zero. If they were walking perfectly straight ahead, then viewed from the back, their position stays at zero if they walk in such a straight line.

If they have had too much to drink, their steps might be biased a bit more. We can increase the standard deviation of the normal distribution to make the distribution wider.

In [31]:
from scipy.stats import norm

# 20 steps for a regular personn, showing the deviation to the 
# left (negative) or to the right (positive) when they are 
# walking straight. Values are in centimeters.
regular_steps = norm.rvs(loc=0, scale=5, size = 20)
print('Regular walking: \n{}'.format(regular_steps))

# Consumed too much? Standard deviation (scale) is larger:
deviating_steps = norm.rvs(loc=0, scale=12, size = 20)
print('Someone who has consumed too much: \n{}'.format(deviating_steps))

Regular walking: 
[ 3.36872552e+00 -1.13917806e+01  3.22569416e+00 -9.77637798e-01
  4.08921191e+00  7.35784690e-01 -2.90071068e+00  3.75347963e+00
 -2.45082272e+00  7.49820386e+00 -4.93361234e+00  3.48327341e+00
  1.13822432e-02 -6.85604873e+00 -8.68931720e+00 -3.41468947e-01
  2.16465212e+00  6.57643141e-01 -7.34073427e-01 -1.33100624e-01]
Someone who has consumed too much: 
[ 19.65502962   1.39951335  32.73139175   7.04282902  -3.19859331
 -20.33808233   9.63950154   5.80177213   0.9900464    1.92216785
 -20.843857   -18.43692294 -11.35410504   7.06983521   8.08637714
 -24.71550656   8.58118158   7.17017213  -7.92971946  16.67303932]


##### Questions

1. Visualize the histogram of 1000 steps of someone who is walking *normally* 😃 
2. Visualize, in a subplot, side-by-side, the histogram of someone who has consumed too much.

Both histograms should be centered at zero. Give each histogram a title, and a label on the x-axis, including units of centimeters.

***Hint*** To create a Pandas series of the values, remember [from worksheet 7](https://yint.org/pybasic07) that you can do that as follows:
```python
import pandas as pd
steps = pd.Series(data = ...)
```

In [43]:
# Put your code here

A person walking in random way has a cumulative effect. If they have deviated 30cm to the left, therefore they are at `-30`, and their next step is to sway to the right by 10cm, then they will be at `-20`. 

Modify your code to show the histogram of the cumulative sum of the deviations! You only need to make a very small modification to do this - thank you Pandas!

#### ➜  Challenge yourself:  Can you see normal distributions using a histogram?

Many statistical tools require the data to be normally distributed. 

Novices fall in the trap of plotting the histogram, and saying that it looks to be normally distributed, and then keep going with their next steps. There is a better way to test this,which is shown in [the next module](https://yint.org/pybasic10).

But for now, try to run this code, to convince yourself that histograms are not a great tool to visualize if data are normally distributed. 

*** After years of experience, and working with data you will find your own approach. ***

Here is my 6-step approach (it is not linear, but iterative): **Define**, **Get**, **Explore**, **Clean**, **Manipulate**, **Communicate**

1. **Define**/clarify the *objective*. Write down exactly what you need to deliver to have the project/assignment considered as completed.

 Then your next steps become clear.
 
 

2. Look for and **get** your data (or it will be given to you by a colleague). Since you have your objective clarified, it is clearer now which data, and how much data you need.

3. Then start looking at the data. Are the data what we expect? This is the **explore** step. Use plots and table summaries.

4. **Clean** up your data. This step and the prior step are iterative. As you explore your data you notice problems, bad data, you ask questions, you gain a bit of insight into the data. You clean, and re-explore, but always with the goal(s) in mind. Or perhaps you realize already this isn't the right data to reach your objective. You need other data, so you iterate.

5. Modifying, making calculations from, and **manipulate** the data. This step is also called modeling, if you are building models, but sometimes you are simply summarizing your data to get the objective solved.

6. From the data models and summaries and plots you start extracting the insights and conclusions you were looking for. Again, you can go back to any of the prior steps if you realize you need that to better achieve your goal(s). You **communicate** clear visualizations to your colleagues, with crisp, short text explanations that meet the objectives.

___

The above work flow (also called a '*pipeline*') is not new or unique to this course. Other people have written about similar approaches:

* Garrett Grolemund and Hadley Wickham in their book on <a href="http://r4ds.had.co.nz/index.html" target="_blank">R for Data Science</a> have this diagram (from <a href="http://r4ds.had.co.nz/explore-intro.html" target="_blank">this part</a> of their book). It matches the above, with slightly different names for the steps. It misses, in my opinion, the most important step of ***defining your goal*** first.
<img src="images/general/data-science-explore--Wickham-and-Grolemund-book.png">

___
* Hilary Mason and Chris Wiggins in their article on <a href="http://www.dataists.com/2010/09/a-taxonomy-of-data-science/" target="_blank">A Taxonomy of Data Science</a> describe their 5 steps in detail:
 1. **Obtain**: pointing and clicking does not scale. In other words, pointing and clicking in Excel, Minitab, or similar software is OK for small data/quick analysis, but does not scale to large data, nor repeated data analysis. 
 1. **Scrub**: the world is a messy place
 1. **Explore**: you can see a lot by looking
 1. **Models**: always bad, sometimes ugly
 1. **Interpret**: "the purpose of computing is insight, not numbers."
 
 You can read their article, as well as <a href="https://towardsdatascience.com/a-beginners-guide-to-the-data-science-pipeline-a4904b2d8ad3" target="_blank">this view on it</a>, which is bit more lighthearted.
 
___

What has been your approach so far?

>***Feedback and comments about this worksheet?***
> Please provide any anonymous [comments, feedback and tips](https://docs.google.com/forms/d/1Fpo0q7uGLcM6xcLRyp4qw1mZ0_igSUEnJV6ZGbpG4C4/edit).

In [11]:
# IGNORE this. Execute this cell to load the notebook's style sheet.
from IPython.core.display import HTML
css_file = './images/style.css'
HTML(open(css_file, "r").read())