# 4. Data Manipulation Libraries and Tools
*Module: Basic Data Manipulation (Sprint 2 of 2)

## Sprint Module Review and Data Stories

#### Basic Data Manipulation
*Before you engage in structured analysis, you often just want to see the data. This can mean pre-viewing a subset of it, summarize the columns/attributes/features, sorting or reorganizing it and otherwise finding ways to immerse yourself in your data. Different technologies tools have something different to offer, and our objective is to develop a good sense of the utilities available to you*

|Data Journalist| Data Engineer | Statistical Modeler| Business Analyst |
|----|----------------|------------------|----|
|… I need to be able to **convert published research and analysis from Excel / R / Python** into a different tool so I can verify and audit the analysis|… I need to understand the **basic data structures in Python** so that I can diagnose and troubleshoot performance issues|…I need to understand the **NumPy arrays and Pandas / R dataframes** so I can supply data to algorithms, fit models, etc|… I need to understand how to export my **advanced excel skills to R / Python** so that I can build more powerful analyses on top of what I already know|

## Analytical Process Big Picture
![Curriculum Summary](../curriculum_summary.png)

### A Tools Capstone
What you have so far:
- A map/taxonomy of technologies: Programming languages, editors, command line environments, source control, publishing, cloud computing
- awareness of several "languages": SQL, R, Python
- a basic idea of the components of programming
- vocabulary of data and computing concepts
- some experience assembling some of these pieces into projects, and using these tools

### Tools development never stops
You will not master these tools today. You won't master them this year. You won't master them in 10 years. You **will** develop a clearer and clearer picture of what you need to do and the most efficient ways to do it with increased **practice** and **experience**

This sprint is about building a map of the available tools so you can access them when ready, and drafting a process that you can use to learn them on a continuing basis

### Where do data manipulation tasks fit into the big picture above?
- Exploratory Data Analysis:
        - Creating
        - Combining
        - Converting
        - Cleaning
- Descriptive Statistics: Formatting / Summarizing
- Basic Data Visualization: Formatting Data / Visualizing
- Inferential Statistics: Sampling
- Modeling: Preparing Data for input into
- Data Governance: Anonymizing and Sanitizing
        - Removing
        - encrypting
- Production Development: Everything
- Data Products
        - Creating APIs
        - exporting data
        - connecting data to clients
        - formatting it

## Key Concepts and Definitions
- library
- package
- function
- function call
- ecosystem
- data-wrangling
- Pandas
- Numpy
- Tidyverse
- array operations
- row/column/table/element-"wise" operations



## Key Questions
- What is a library?
- What is a package?
- How do I get them?
- How are they created?
- What libraries / packages are available to me?
- What do each of them do?
- How do I learn about them?

### What do the Libraries do?
- https://www.quora.com/What-are-the-Python-libraries-that-are-used-by-data-scientists/answer/Jared-Stufft-1

### EVERYTHING. So why do we need to do anything? 
- Swiss Army knife: Clever, but takes some practice
- In Particular: Data Wrangling

### What is Data Wrangling?
> Unfortunately, data wrangling is 80% of what a data scientist does. It’s where most of the real value is created and it’s the most thankless, difficult, and poorly understood job I know of. Nobody gets a degree in data wrangling. Nobody publishes papers on it. Nobody teaches how to do it. (Yes there are courses on how to use specific tools like R or Python to do simple joins and dupe removal, but they assume that you already know how and why you are wrangling.)

> There are six steps in data wrangling:

> Gather data from inside and outside the firewall
Understand (and document) your sources and their limitations
Clean up the duplicates, blanks, and other simple errors
Join all your data into a single table
Create new data by calculating new fields and recategorizing
Visualize the data to remove outliers and illogical results
The first four are straightforward albeit annoying. Most people do steps 1 and 3 and then jump in to do their analysis. They then spend several weeks discovering all kinds of additional errors as they try to get their models to work.

> https://www.quora.com/What-exactly-is-Data-wrangling/answer/Dan-Haight



### Python Ecosystem

![Python Ecosystem](00_images/python.jpeg)
- https://www.quora.com/What-is-the-relationship-among-NumPy-SciPy-Pandas-and-Scikit-learn-and-when-should-I-use-each-one-of-them/answer/Jeremy-Langley
- http://pandas.pydata.org/pandas-docs/stable/basics.html

These are all tools in the field of data science. The lower level you get the faster speeds you can achieve. The higher level you go the more interesting problems you might be able to solve. You can see which sits on top of another thanks to Jake Vanderplas

"Numpy is the lowest level sitting on Python. It reads in fixed datatypes. It's data layout is more concerned with efficiency of memory. If you are dealing with strings they are fixed length strings. (fit the data size for each element to the longest string length) but it shines when you are dealing with number calculations. The more you can think in vectors the faster your code runs. (learn how to get rid of the for statements for speed reasons by using Numpy broadcasting)

Pandas is spreadsheets for Python (something like R). It's able to describe the data for you. It can do grouping and pivot tables on larger data than most spreadsheet programs out there. The only limit (currently) is how much RAM you have on the machine same as Numpy. However there is a project Blaze which is helping to overcome this limit."

#### Numpy
- https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
- https://docs.scipy.org/doc/numpy/reference/routines.html
- Array Creation
- Printing Arrays
- Linear Algebra
- Array Product
- Matrix Product
- Element-wise Universal Functions
- Indexing, Slicing, Iterating
- Shape Manipulation
- Stacking
- Splitting / Copying / Views
- Shallow / Deep Copy
- Basic Statistics
- Broadcasting
- Histograms

#### Pandas
- http://pandas.pydata.org/pandas-docs/stable/basics.html
- Boolean reductions
- missing values
- comparison functions
- overlapping data sets
- descriptive statistics
- index of Min/Max Values
- discretizing / quantiling
- table/row/column/element-wise function application
- aggregation
- transform
- reindex
- align
- labels
- date/time
- sorting
- searching
- multi-index


### R Ecosystem - Tidyverse
![Tidyverse](00_images/tidyverse_diagram.png)

- https://www.tidyverse.org/
- http://fg2re.sellorm.com/
- https://hawaiimachinelearning.github.io/event/2017/12/18/exploratory-data-analysis-tidyverse/



#### dplyr
- mutate
- select
- filter
- summarise
- arrange

#### tidyr
- gather
- spread

#### readr
- read_csv
- read_tsv
- read_delim
- read_fwf
- read_table
- read_log

#### purrr
- map...

#### tibble

#### ggplot

### Project Ideas

- Using a dataset that you care about, use each function in a Pandas to see what it does and how it works
- Create a collection of examples of projects on github that use each of the functions you're interested in. 
- Try to implement functions from the libraries manually and compare the results
- Review and modify the tidyverse code from the Machine learning group



In [7]:
#Randomizer
import random
import numpy
cohort = ["hunter","jon","michael","olina", "nat", "runjini", "sheuli","tori"]
random.shuffle(cohort)

print("Day 1/2/3:")
print(cohort)
cohort = numpy.roll(cohort,1)
print("Day 4/5/6:")
print(cohort)


Day 1/2/3:
['tori', 'jon', 'nat', 'runjini', 'michael', 'olina', 'hunter', 'sheuli']
Day 4/5/6:
['sheuli' 'tori' 'jon' 'nat' 'runjini' 'michael' 'olina' 'hunter']
