<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

# Intro to Data Cleaning

***

Week 2 | Lesson 2.3

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*

- Inspect data types and values
- Diagram a data processing workflow
- Clean up a column using df.apply()

### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 5 min  | [Introduction](#introduction)   | Inspect data types, df.apply(), .value_counts()  |
| 10 min  | [Demo /Guided Practice](#common_steps)  | Common cleaning |
| 10 min  | [Demo /Guided Practice](#dfd)  | Data flow diagrams |
| 10 min  | [Demo /Guided Practice](#inspect_data_types)  | Inspecting data types |
| 10 min  | [Demo /Guided Practice](#apply)  | Applying functions |
| 10 min  | [Demo /Guided Practice](#value_counts)  | .value_counts() |
| 20 min  | [Independent Practice](#ind-practice)  |   |
| 5 min  | [Conclusion](#conclusion)  |   |

<a name="introduction"></a>
## Introduction: data cleaning (5 mins)

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add some more tools to our toolbox.




### Conceptually: what do you look for, and how do you stay organized?

There's no magic formula, but we'll list common cleaning operations.

Reproducibility matters, so document! For high-level planning and documentation, **data flow diagrams** are helpful.



### Technically: once you know what you want to do, how do you do it in pandas?

Pandas has many functions to help process and manipulate your data. You learned some this morning. We'll take a second look at .dtypes, .apply() and .value_counts.

**.dtypes** is the data type attribute of numpy/pandas objects.

**df.apply()** applies a function along any axis of a DataFrame.

**pandas.Series.value_counts** returns a Series containing counts of unique values. Excludes NaN values.

<a name="common_steps"></a>

## Common steps in cleaning data

- Drop outliers

For numerical data, consider dropping values > 3 SDs from the mean.

For categorical data, consider dropping cases accounting for < 1% of items.

- Normalize

Options include:

Max-min: $X_{norm} = (X − X_{min}) / (X_{max} − X_{min})$

Z-score: $(X_i - mean(X)) / sd$

## Common steps in cleaning data

- Relabel

*("Bachelor's", "BSc", "Bachelor of Arts") -> "BA"*

- Decode

*1 -> "EU", 2 -> "Asia-Pacific", 3 -> "MENA"*



## Common steps in cleaning 

- Recast

*("1.0", "2.0", "3.0") -> (1,2,3)*

- Handle null values

Pandas will usually impute NaNs for you. Drop them? Replace with estimates?

## Common steps in cleaning 

- Binarize (dummy variables)

*(Blue, Green, Blue, Red, Red, Green) ->*
*IsBlue: (1, 0, 1, 0, 0, 0); IsGreen: (0, 1, 0, 0, 0, 1); IsRed: (0, 0, 0, 1, 1, 0)*

- Discretization

*(20, 56, 7, 2, 14, 89, 70, 40) -> (Adult, Adult, Child, Child, Child, Senior, Senior, Adult)*

> Look at the Billboard dataset (below). What kinds of cleaning might it require?

In [36]:
import pandas as pd
bb = pd.read_csv('assets/datasets/billboard.csv')
bb.head(15)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,"3,38,00 AM",Rock,"September 23, 2000","November 18, 2000",78,63,49,...,*,*,*,*,*,*,*,*,*,*
1,2000,Santana,"Maria, Maria","4,18,00 AM",Rock,"February 12, 2000","April 8, 2000",15,8,6,...,*,*,*,*,*,*,*,*,*,*
2,2000,Savage Garden,I Knew I Loved You,"4,07,00 AM",Rock,"October 23, 1999","January 29, 2000",71,48,43,...,*,*,*,*,*,*,*,*,*,*
3,2000,Madonna,Music,"3,45,00 AM",Rock,"August 12, 2000","September 16, 2000",41,23,18,...,*,*,*,*,*,*,*,*,*,*
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),"3,38,00 AM",Rock,"August 5, 2000","October 14, 2000",57,47,45,...,*,*,*,*,*,*,*,*,*,*
5,2000,Janet,Doesn't Really Matter,"4,17,00 AM",Rock,"June 17, 2000","August 26, 2000",59,52,43,...,*,*,*,*,*,*,*,*,*,*
6,2000,Destiny's Child,Say My Name,"4,31,00 AM",Rock'n'roll,"December 25, 1999","March 18, 2000",83,83,44,...,*,*,*,*,*,*,*,*,*,*
7,2000,"Iglesias, Enrique",Be With You,"3,36,00 AM",Latin,"April 1, 2000","June 24, 2000",63,45,34,...,*,*,*,*,*,*,*,*,*,*
8,2000,Sisqo,Incomplete,"3,52,00 AM",Rock'n'roll,"June 24, 2000","August 12, 2000",77,66,61,...,*,*,*,*,*,*,*,*,*,*
9,2000,Lonestar,Amazed,"4,25,00 AM",Country,"June 5, 1999","March 4, 2000",81,54,44,...,*,*,*,*,*,*,*,*,*,*


In [245]:
bb.columns.values

array(['year', 'artist.inverted', 'track', 'time', 'genre', 'date.entered',
       'date.peaked', 'x1st.week', 'x2nd.week', 'x3rd.week', 'x4th.week',
       'x5th.week', 'x6th.week', 'x7th.week', 'x8th.week', 'x9th.week',
       'x10th.week', 'x11th.week', 'x12th.week', 'x13th.week',
       'x14th.week', 'x15th.week', 'x16th.week', 'x17th.week',
       'x18th.week', 'x19th.week', 'x20th.week', 'x21st.week',
       'x22nd.week', 'x23rd.week', 'x24th.week', 'x25th.week',
       'x26th.week', 'x27th.week', 'x28th.week', 'x29th.week',
       'x30th.week', 'x31st.week', 'x32nd.week', 'x33rd.week',
       'x34th.week', 'x35th.week', 'x36th.week', 'x37th.week',
       'x38th.week', 'x39th.week', 'x40th.week', 'x41st.week',
       'x42nd.week', 'x43rd.week', 'x44th.week', 'x45th.week',
       'x46th.week', 'x47th.week', 'x48th.week', 'x49th.week',
       'x50th.week', 'x51st.week', 'x52nd.week', 'x53rd.week',
       'x54th.week', 'x55th.week', 'x56th.week', 'x57th.week',
       'x58th.week', '

<a name="dfd"></a>


## Planning your system with data flow diagrams (10 mins)


<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/24/Data-flow-diagram-notation.svg/220px-Data-flow-diagram-notation.svg.png">



- Function / process (ellipse or circle): takes in data from one or more sources, transforms it, and outputs it to one or more destinations 

- Store / file (parallel lines): place where data persists; takes an input and ought to have an output

- Input-output (rectangle): process that produces or consumes data

- Flow (arrow): shows the flow of specific data


> What is an example of data processing you may want to do? Sketch a data flow diagram and walk your tablemates through it.

<a name="inspect_data_types"></a>
## Demo /Guided Practice: Inspect data types  (10 mins)

Let's create a small dictionary with different data types in it. 

### Import Pandas + Numpy

In [1]:
import pandas as pd
import numpy as np

### Create Test Data

In [2]:
test_data = dict( 
    A = np.random.rand(3),
    B = 1,
    C = 'foo',
    D = pd.Timestamp('20010102'),
    E = pd.Series([1.0]*3).astype('float32'),
    F = False,
    G = pd.Series([1]*3,dtype='int8')
)

In [3]:
test_data

{'A': array([ 0.93014345,  0.87683759,  0.11540669]),
 'B': 1,
 'C': 'foo',
 'D': Timestamp('2001-01-02 00:00:00'),
 'E': 0    1.0
 1    1.0
 2    1.0
 dtype: float32,
 'F': False,
 'G': 0    1
 1    1
 2    1
 dtype: int8}

### Create our DataFrame

In [4]:
dft = pd.DataFrame(test_data)
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.930143,1,foo,2001-01-02,1.0,False,1
1,0.876838,1,foo,2001-01-02,1.0,False,1
2,0.115407,1,foo,2001-01-02,1.0,False,1


In [5]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

**What might we expect dtypes in the case of mixed type values in a single dimension?**

ie:  [2, 3, 4, 5, 6, 7, 8.9]

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

### Ints are cast to floats

In [232]:
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

### String elements are cast to ``object`` dtype

In [233]:
pd.Series([1, 2, 3, 'foo'])

0      1
1      2
2      3
3    foo
dtype: object

In [234]:
dft.get_dtype_counts()

bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: int64

> If you turn these into a pd.Series, what will be the dtype?

    [1, 3, 9, .33, False, '03-20-1978', np.arange(22)]



In [235]:
pd.Series([1, 3, 9, .33, False, '03-20-1978', np.arange(22)])

0                                                    1
1                                                    3
2                                                    9
3                                                 0.33
4                                                False
5                                           03-20-1978
6    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
dtype: object

## Why do you think it might be important to know what the column dtypes are? 

In [8]:
print(dft.select_dtypes(include=['bool']))
print(dft.select_dtypes(include=['bool'])*2)
print(dft.select_dtypes(include=['object']))
print(dft.select_dtypes(include=['object'])*2)
dft.select_dtypes(include=['object'])*0.5

       F
0  False
1  False
2  False
   F
0  0
1  0
2  0
     C
0  foo
1  foo
2  foo
        C
0  foofoo
1  foofoo
2  foofoo


TypeError: Could not operate 0.5 with block values can't multiply sequence by non-int of type 'float'

<a name=" df.apply()"></a>
## Demo / Guided Practice:  df.apply(), df.applymap(), Series.map() (20 mins)

df.apply() applies some function to each column (or row) of your dataframe (*"column-wise"*).

df.applymap() applies some function *element-wise*.

Series.map() is a pd.Series method that applies a function element-wise on a series.

> Check: why would these be useful in data cleaning?

In [13]:
# Create test data

df = pd.DataFrame(np.random.randint(-5, 5, (5,4)), columns = ['a','b','c','d'])
df


Unnamed: 0,a,b,c,d
0,3,-4,2,-1
1,0,4,-1,-5
2,-3,-5,0,4
3,3,1,0,2
4,2,3,-3,0


In [238]:
df.apply(min)

a   -4
b   -3
c   -3
d   -5
dtype: int64

In [None]:
df

In [239]:
df.apply(min, axis = 1)

0   -3
1   -5
2   -3
3   -2
4    1
dtype: int64

In [240]:
df.applymap(np.sqrt)


Unnamed: 0,a,b,c,d
0,1.732051,,,
1,,,,
2,1.0,,,1.414214
3,,2.0,2.0,1.414214
4,2.0,2.0,1.0,1.0


In [15]:
df['a'].map(np.sqrt)

0    1.732051
1    0.000000
2         NaN
3    1.732051
4    1.414214
Name: a, dtype: float64

In [18]:
df['a'].apply(min)

TypeError: 'int' object is not iterable

In [14]:
df['a'].apply(np.sqrt)

0    1.732051
1    0.000000
2         NaN
3    1.732051
4    1.414214
Name: a, dtype: float64

> Check: what happens if we try to use the min function in .applymap()?

In [242]:
df.applymap(min)

TypeError: ("'numpy.int64' object is not iterable", u'occurred at index a')

### Further Reading

For more advanced `.apply` usage, check out these links:

["Why Not"'s Gist Examples](https://gist.github.com/why-not/4582705)

[Chris Albon's Map + Apply Examples](http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html)


### **Check:** How would you find the std of the columns and rows? 

In [10]:
df.apply(np.std, axis = 0)

a    2.653300
b    1.720465
c    2.481935
d    3.249615
dtype: float64

In [11]:
df.apply(np.std, axis = 1)

0    2.947457
1    1.118034
2    1.500000
3    2.680951
4    2.487469
dtype: float64

<a name=".value_counts()"></a>
## Demo /Guided Practice: .value_counts() (< 10 mins)

Why is this important?  Basically, this tells us the count of unique values that exist.  It's helpful to identify anything unexpected.  Looking at value_counts(), per series, can give us a quick overview of values expressed in our data.

 - Strings inside of mostly numeric / continious data
 - Non-numeric values
 - General counts of values that we might expect to see
 - Most common / least common values

Let's create some random data

In [46]:
data = np.random.randint(0, 7, size = 50)
data

array([2, 0, 2, 5, 1, 0, 1, 2, 4, 1, 0, 4, 0, 4, 2, 5, 1, 0, 3, 2, 5, 4, 0,
       3, 4, 2, 2, 3, 3, 3, 1, 1, 1, 0, 1, 3, 3, 6, 6, 4, 6, 5, 5, 1, 0, 3,
       6, 0, 2, 3])

In [47]:
s = pd.Series(data)
s.head()

0    2
1    0
2    2
3    5
4    1
dtype: int64

In [48]:
# The counts of each number that occurs in our array is listed
s.value_counts()

3    9
1    9
0    9
2    8
4    6
5    5
6    4
dtype: int64

### Lab preview: let's munge the Billboard dataset!

In [49]:
bb.head(10)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,"3,38,00 AM",Rock,"September 23, 2000","November 18, 2000",78,63,49,...,*,*,*,*,*,*,*,*,*,*
1,2000,Santana,"Maria, Maria","4,18,00 AM",Rock,"February 12, 2000","April 8, 2000",15,8,6,...,*,*,*,*,*,*,*,*,*,*
2,2000,Savage Garden,I Knew I Loved You,"4,07,00 AM",Rock,"October 23, 1999","January 29, 2000",71,48,43,...,*,*,*,*,*,*,*,*,*,*
3,2000,Madonna,Music,"3,45,00 AM",Rock,"August 12, 2000","September 16, 2000",41,23,18,...,*,*,*,*,*,*,*,*,*,*
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),"3,38,00 AM",Rock,"August 5, 2000","October 14, 2000",57,47,45,...,*,*,*,*,*,*,*,*,*,*
5,2000,Janet,Doesn't Really Matter,"4,17,00 AM",Rock,"June 17, 2000","August 26, 2000",59,52,43,...,*,*,*,*,*,*,*,*,*,*
6,2000,Destiny's Child,Say My Name,"4,31,00 AM",Rock'n'roll,"December 25, 1999","March 18, 2000",83,83,44,...,*,*,*,*,*,*,*,*,*,*
7,2000,"Iglesias, Enrique",Be With You,"3,36,00 AM",Latin,"April 1, 2000","June 24, 2000",63,45,34,...,*,*,*,*,*,*,*,*,*,*
8,2000,Sisqo,Incomplete,"3,52,00 AM",Rock'n'roll,"June 24, 2000","August 12, 2000",77,66,61,...,*,*,*,*,*,*,*,*,*,*
9,2000,Lonestar,Amazed,"4,25,00 AM",Country,"June 5, 1999","March 4, 2000",81,54,44,...,*,*,*,*,*,*,*,*,*,*


Where do we start? Let's start with the null value sentinels.

In [50]:
def replace_nulls(value):
    if value == '*':
        return np.nan
    else:
        return value



In [52]:
bb.applymap(replace_nulls)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,"3,38,00 AM",Rock,"September 23, 2000","November 18, 2000",78,63,49,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria","4,18,00 AM",Rock,"February 12, 2000","April 8, 2000",15,8,6,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,"4,07,00 AM",Rock,"October 23, 1999","January 29, 2000",71,48,43,...,,,,,,,,,,
3,2000,Madonna,Music,"3,45,00 AM",Rock,"August 12, 2000","September 16, 2000",41,23,18,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),"3,38,00 AM",Rock,"August 5, 2000","October 14, 2000",57,47,45,...,,,,,,,,,,
5,2000,Janet,Doesn't Really Matter,"4,17,00 AM",Rock,"June 17, 2000","August 26, 2000",59,52,43,...,,,,,,,,,,
6,2000,Destiny's Child,Say My Name,"4,31,00 AM",Rock'n'roll,"December 25, 1999","March 18, 2000",83,83,44,...,,,,,,,,,,
7,2000,"Iglesias, Enrique",Be With You,"3,36,00 AM",Latin,"April 1, 2000","June 24, 2000",63,45,34,...,,,,,,,,,,
8,2000,Sisqo,Incomplete,"3,52,00 AM",Rock'n'roll,"June 24, 2000","August 12, 2000",77,66,61,...,,,,,,,,,,
9,2000,Lonestar,Amazed,"4,25,00 AM",Country,"June 5, 1999","March 4, 2000",81,54,44,...,,,,,,,,,,


<a name="ind-practice"></a>
## Independent Practice: Topic (20 minutes)

Using our old friend, the [sales_info.csv](assets/datasets/sales_info.csv) dataset:

- Inspect the data types
- Let's say all your values in the first column are too low by 1: use df.applymap to add 1 to each value in it
- Use .value_counts to count the values of each column in the dataset

**Bonus** 
- Write functions to bin the numerical values in each column into 'low', 'medium' and 'high' categories (hint: pandas has a built-in [quantile](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.quantile.html) function)
- Use .value_counts again on each column of the dataset, and check that the ratios of the new values are what you expect
- Look at the advanced reading, and rewrite your binning functions as lambda functions



In [28]:
sales_df = pd.read_csv('sales_info.csv')
sales_df.head()


Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,18.42076,93.802281,337166.53,337804.05
1,4.77651,21.082425,22351.86,21736.63
2,16.602401,93.612494,277764.46,306942.27
3,4.296111,16.824704,16805.11,9307.75
4,8.156023,35.011457,54411.42,58939.9


In [29]:
def add_one(x):
    x = x+1
    return x

In [30]:
a = 18
add_one(a)

19

In [31]:
sales_df.dtypes

volume_sold      float64
2015_margin      float64
2015_q1_sales    float64
2016_q1_sales    float64
dtype: object

In [33]:
sales_df['volume_sold'].apply(add_one)

0      19.420760
1       5.776510
2      17.602401
3       5.296111
4       9.156023
5       6.005122
6      15.606750
7       5.456466
8       6.047530
9       6.388070
10     10.347349
11     11.930398
12      7.270209
13     13.395919
14      5.557712
15      5.200122
16     11.252870
17     13.076785
18      4.725095
19      4.210727
20      7.290971
21      8.434821
22      5.376225
23     13.988913
24     12.697456
25      6.965175
26      4.945223
27      8.369585
28      8.343509
29     13.350027
         ...    
170     9.443932
171     6.151964
172     7.537069
173     9.500445
174     4.931543
175     7.163689
176     5.904447
177     8.402413
178    48.503269
179    56.739180
180    12.840780
181     8.002294
182     9.753142
183     4.147741
184     8.196779
185    77.203692
186    11.804337
187    11.705327
188    52.800686
189     6.882779
190     7.686406
191     6.833355
192    46.556096
193     6.172606
194    11.118018
195    52.675537
196     3.794631
197     8.6116

<a name="conclusion"></a>
## Conclusion (5 mins)
So far we've used pandas to look at the head and tail of a data set. We've also taken a look at summary stats and different
types of data types. We've selected and sliced data too.

Today we added inspecting data types, df.apply, .value_counts to
our pandas arsenal.

### Advanced reading (optional)

.apply() functions are a typical use case for [lambda](http://www.secnetix.de/olli/Python/lambda_functions.hawk) [functions](https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/) -- we'll go over them later in the class, but dive in now if you'd like!