> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses). 
>
>Please reuse, remix, revise, and [reshare this content](https://github.com/kgdunn/python-basic-notebooks) in any way, keeping this notice.

# Module 9: Overview 

In the prior [module 8](https://yint.org/pybasic08) you got more exposure to Pandas data frames.

In this module we use these data frames from getting a brief exposure to **statistics** and **plotting**. We can look at each topic separately, but they go hand-in-hand. You've probably heard: "*always start your data analysis by plotting your data*". There's a good reason for that: the type of statistical analysis is certainly guided by what is in that data. Plotting the data is one of the most effective ways to figure that out.

<hr>
<img src="images/general/Crystal_Clear_action_db_commit.png" style="width: 100px ; float:right"/> Check our this repo using Git. Use your favourite Git user-interface, or at the command line:

>```
>git clone git@github.com:kgdunn/python-basic-notebooks.git
>
># If you already have the repo cloned:
>git pull
>```

to update it to the later version.



### Preparing for this module###

You should have read [Chapter 1](https://learnche.org/pid/data-visualization/) of the book "Process Improvement using Data".

### Summarizing data visually and numerically (statistics)

Across two modules - this module 9 and module 10  - we will cover these topics"

1. Box plots
2. Bar plots (bar charts) 
3. Histograms
4. Data tables
5. Time-series, or sequence plots
6. Scatter plots

In between, throughout the notes, we will also introduce statistical and data science concepts. This way you will learn how to interpret the plots and also communicate your results with the correct language.

## A general work flow for any project where you deal with data

After years of experience, and working with data you will find your own approach. 

Here is my 6-step approach (not linear, but iterative): **Define**, **Get**, **Explore**, **Clean**, **Manipulate**, **Communicate**

1. **Define**/clarify the objective. Write down exactly what you need to deliver to have the project/assignment considered as completed.

 Then your next steps become clear.
 
 

2. Look for and **get** your data (or it will be given to you by a colleague). Since you have your objective clarified, it is clearer now what data, and how much data you need.

3. Then start looking at the data. Are the data what we expect? This is the **explore** step. Use plots and summaries.

4. **Clean** up your data. This step and the prior step are iterative. As you explore your data you notice problems, bad data entry, you ask questions, you gain a bit of insight into the data. You clean, and re-explore, but always with the goal(s) in mind. Or perhaps you realize already this isn't the right data to reach your objective.

5. Modifying, making calculations from, and **manipulate** the data. This step is also called modeling, if you are building models, but sometimes you are simply summarizing your data.

6. From the data models and summaries and plots you start extracting the insights and conclusions you were looking for. Again, you can go back to any of the prior steps if you realize you need that to better achieve your goal(s). You **communicate** clear visualizations to your colleagues, with crisp, short explanations that meet the objectives.

___

The above work flow (also called a '*pipeline*') is not new or unique to this course. Other people have written about similar approaches:

* Garrett Grolemund and Hadley Wickham in their book on <a href="http://r4ds.had.co.nz/index.html" target="_blank">R for Data Science</a> have this diagram (from <a href="http://r4ds.had.co.nz/explore-intro.html" target="_blank">this part</a> of their book):
<img src="images/general/data-science-explore--Wickham-and-Grolemund-book.png">

___
* Hilary Mason and Chris Wiggins in their article on <a href="http://www.dataists.com/2010/09/a-taxonomy-of-data-science/" target="_blank">A Taxonomy of Data Science</a> describe their 5 steps in detail:
 1. **Obtain**: pointing and clicking does not scale.
 1. **Scrub**: the world is a messy place
 1. **Explore**: you can see a lot by looking
 1. **Models**: always bad, sometimes ugly
 1. **Interpret**: "the purpose of computing is insight, not numbers."
 
 You can read their article, as well as <a href="https://towardsdatascience.com/a-beginners-guide-to-the-data-science-pipeline-a4904b2d8ad3" target="_blank">this view on it</a>, which is bit more lighthearted.
 
___

What has been your approach so far?

>***Feedback and comments about this worksheet?***
> Please provide any anonymous [comments, feedback and tips](https://docs.google.com/forms/d/1Fpo0q7uGLcM6xcLRyp4qw1mZ0_igSUEnJV6ZGbpG4C4/edit).

In [1]:
# IGNORE this. Execute this cell to load the notebook's style sheet.
from IPython.core.display import HTML
css_file = './images/style.css'
HTML(open(css_file, "r").read())