# Data Manipulation Workshop
## Ann Arbor Data Dive

Instructor: [Jeff Lockhart](http://www-personal.umich.edu/~jwlock/)

Date: 11/11/2017, 8:30 - 9:30 AM

Materials online at: **[github.com/jwlockhart/data_workshops](https://github.com/jwlockhart/data_workshops/tree/master/intro_data_manip)**

## Import packages
- Packages contain a whole bunch of useful tools and functions for doing things in python. 
- `pandas` is a package of tools for working with data.
- Here I have told python to use the abbreviation `pd` to refer to `pandas`. Programmers often do this so that we can type less. 
- `matplotlib` is a package for making charts and graphs, and here we're going to use the `pyplot` part of it and abbreviate that as `plt`
- `%matplotlib inline` is what Jupyter Notebooks call "magic." It tells the notebook to show us the graphs in the notebook rather than saving them as files or having them pop up. 

In [5]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Load data
- This code reads in data so that we can work with it in python. 
- We'll use different code later to save what we have done into a file so that we can use it later. 
- pandas can read and write data saved in many formats with these other functions:
    - read_csv /  to_csv
    - read_json /  to_json
    - read_html /	to_html
    - read_clipboard /	to_clipboard
    - read_excel /	to_excel
    - read_hdf /	to_hdf
    - read_feather /	to_feather
    - read_msgpack /	to_msgpack
    - read_stata /	to_stata
    - read_sas 	 
    - read_pickle /	to_pickle
    - read_sql /	to_sql
    - read_gbq /	to_gbq (Google Big Query)

In [9]:
gss = pd.read_csv('gss.csv')

## Learn a bit about our data

In [10]:
print("The GSS data has", gss.shape[0], "rows and", gss.shape[1], "columns.")

The GSS data has 32561 rows and 15 columns.


In [11]:
gss.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [12]:
gss['age'].describe()

count    32561.000000
mean        38.581647
std         13.640433
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

In [13]:
gss['education'].value_counts()

HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64