# Module 1 - Manipulating data with Pandas

![](https://media.giphy.com/media/lPdnkrxkqnS48/giphy.gif)

### Introduction
You are interested in predicting health outcomes for people at risk for heart disease.  You have obtained a set of labeled data. Before modeling, you will spend time performing exploratory data analysis and begin with feature engineering. 

#### _Our goals today are to be able to_: <br/>

- Apply and use info, describe, mean, min, max, apply, and applymap from the Pandas library
- Explain what a groupby object is and split a DataFrame using a groupby
- Explain lambda functions and use them to use an apply on a DataFrame
- Reshape a DataFrame using joins, merges, pivoting, stacking, and melting


### Activation 
Compare attributes and methods of numpy array, pandas series and dataframes.<br>
[array](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.html)<br>
[series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)<br>
[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [21]:
# Write two class attributes and two class methods 
# that you were unfamiliar with from each object type

### Our dataset comes from Kaggle, but has been downloaded for you. 

Take a second to checkout the website from which it came:
https://www.kaggle.com/ronitf/heart-disease-uci.

### 1. Applying and using info, describe, mean, min, max, apply, and applymap from the Pandas library

The Pandas library has several useful tools built in. Let's explore some of them.

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [15]:
original_df = pd.read_csv('heart.csv')
uci = original_df.copy()

In [18]:
uci.tail(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
293,67,1,2,152,212,0,0,150,0,0.8,1,0,3,0
294,44,1,0,120,169,0,1,144,1,2.8,0,0,1,0
295,63,1,0,140,187,0,0,144,1,4.0,2,2,3,0
296,63,0,0,124,197,0,1,136,1,0.0,1,0,2,0
297,59,1,0,164,176,1,0,90,0,1.0,1,2,1,0
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


Notice the name of the last column!

#### The .columns and .shape Attributes

In [15]:
uci.shape

(303, 14)

#### The .info() and .describe() Methods

Pandas DataFrames have many useful methods! Let's look at ```.info()``` and ```.describe()```.

In [7]:
# Call the .info() method on our dataset. What do you observe?


In [8]:
# Call the .describe() method on our dataset. What do you observe?


#### .mean(), .min(), .max(), .sum()

The methods .mean(), .min(), and .max() will perform just the way you think they will!

Note that these are methods both for Series and for DataFrames.

In [26]:
uci.max()

age          77.0
sex           1.0
cp            3.0
trestbps    200.0
chol        564.0
fbs           1.0
restecg       2.0
thalach     202.0
exang         1.0
oldpeak       6.2
slope         2.0
ca            4.0
thal          3.0
target        1.0
dtype: float64

#### The Axis Variable

In [22]:
# Axis = 0 means 'column-wise'
# Axis = 1 mean 'row-wise'
uci.sum() # Try [shift] + [tab] here!

age         16473.0
sex           207.0
cp            293.0
trestbps    39882.0
chol        74618.0
fbs            45.0
restecg       160.0
thalach     45343.0
exang          99.0
oldpeak       315.0
slope         424.0
ca            221.0
thal          701.0
target        165.0
dtype: float64

#### .value_counts()

For a DataFrame _Series_, the .value_counts() method will tell you how many of each value you've got.

In [37]:
uci['age'].value_counts()[:10]

58    19
57    17
54    16
59    14
52    13
51    12
62    11
44    11
60    11
56    11
Name: age, dtype: int64

$\bf{\rightarrow Exercise: What\ are\ the\ different\ values\ for\ restecg?}$

In [92]:
# Your code here!


#### DataFrame.applymap() and Series.map()

The ```.applymap()``` method takes a function as input that it will then apply to every entry in the dataframe.

In [28]:
def successor(x):
    return x + 1

In [29]:
uci.applymap(successor).head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,64,2,4,146,234,2,1,151,1,3.3,1,1,2,2
1,38,2,3,131,251,1,2,188,1,4.5,1,1,3,2
2,42,1,2,131,205,1,1,173,1,2.4,3,1,3,2
3,57,2,2,121,237,1,2,179,1,1.8,3,1,3,2
4,58,1,1,121,355,1,2,164,2,1.6,3,1,3,2


The .map() method takes a function as input that it will then apply to every entry in the Series.

In [55]:
uci.age.map(successor)

0      64
1      38
2      42
3      57
4      58
       ..
298    58
299    46
300    69
301    58
302    58
Name: age, Length: 303, dtype: int64

In [None]:
# apply can work similarly to map when called
# on a series, but it can also be called on a df, 
# which it applies it column wise or row wise.

In [63]:
def s_range(x):
    return x.max() - x.min()

uci.apply(s_range, axis = 0)


age          48.0
sex           1.0
cp            3.0
trestbps    106.0
chol        438.0
fbs           1.0
restecg       2.0
thalach     131.0
exang         1.0
oldpeak       6.2
slope         2.0
ca            4.0
thal          3.0
target        1.0
dtype: float64

## 2. Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

In [32]:
uci['oldpeak'].map(lambda x: round(x))[:4]

0    2
1    4
2    1
3    1
Name: oldpeak, dtype: int64

$\bf{\rightarrow Exercise: Use\ an\ anonymous\ function\ to\ turn\ the\ entries\ in\ age\ to\ strings}$

In [69]:
# Your code here!


## 3. Methods for Re-Organizing DataFrames: filtering and .groupby()

In [134]:
uci[(uci['age'] == 60) | (uci['target']==1)]


0     63
1     37
2     41
3     56
4     57
      ..
56    48
57    45
58    34
59    57
60    71
Name: age, Length: 61, dtype: int64

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The .groupby() method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [34]:
uci.groupby('sex')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1186c38d0>

### .groups and .get_group()

In [71]:
uci.groupby('sex').groups

{0: Int64Index([  2,   4,   6,  11,  14,  15,  16,  17,  19,  25,  28,  30,  35,
              36,  38,  39,  40,  43,  48,  49,  50,  53,  54,  59,  60,  65,
              67,  69,  74,  75,  82,  84,  85,  88,  89,  93,  94,  96, 102,
             105, 107, 108, 109, 110, 112, 115, 118, 119, 120, 122, 123, 124,
             125, 127, 128, 129, 130, 131, 134, 135, 136, 140, 142, 143, 144,
             146, 147, 151, 153, 154, 155, 161, 167, 181, 182, 190, 204, 207,
             213, 215, 216, 220, 223, 241, 246, 252, 258, 260, 263, 266, 278,
             289, 292, 296, 298, 302],
            dtype='int64'),
 1: Int64Index([  0,   1,   3,   5,   7,   8,   9,  10,  12,  13,
             ...
             288, 290, 291, 293, 294, 295, 297, 299, 300, 301],
            dtype='int64', length=207)}

In [36]:
uci.groupby('sex').get_group(0) # .tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
11,48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
14,58,0,3,150,283,1,0,162,0,1.0,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,55,0,0,128,205,0,2,130,1,2.0,1,1,3,0
292,58,0,0,170,225,1,0,146,1,2.8,1,2,1,0
296,63,0,0,124,197,0,1,136,1,0.0,1,0,2,0
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0


### Aggregating

In [37]:
uci.groupby('sex').std()

Unnamed: 0_level_0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,9.409396,0.972427,19.311119,65.088946,0.332455,0.55715,20.047969,0.422503,1.119844,0.593736,0.881026,0.44129,0.435286
1,8.883803,1.059064,16.658246,42.782392,0.366955,0.510754,24.130882,0.484505,1.174632,0.627378,1.074082,0.659949,0.498626


$\bf{\rightarrow Exercise: Tell\ me\ the\ average\ cholesterol\ level\ for\ those\ with\ heart\ disease.}$

In [39]:
# Your code here!



### 4. Reshaping a DataFrame

#### .pivot()

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [104]:
uci.pivot(values = 'age', columns = 'target')

target,0,1
0,,63.0
1,,37.0
2,,41.0
3,,56.0
4,,57.0
...,...,...
298,57.0,
299,45.0,
300,68.0,
301,57.0,


With whatever method you please, pivot, groupby, subset etc, return the average cholesterol of women in the dataframe

## Methods for Combining DataFrames: .join(), .merge(), .concat(), .melt()

### .join()

In [76]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns = ['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns = ['age', 'HP'])

In [77]:
toy1.join(toy2.set_index('age'), on = 'age',
          lsuffix = '_A', rsuffix = '_B').head()

Unnamed: 0,age,HP_A,HP_B
0,63,142,100
1,33,47,200


In [78]:
pd.merge(toy1, toy2, left_on='age', right_on='age')

Unnamed: 0,age,HP_x,HP_y
0,63,142,100
1,33,47,200


### .merge()

In [80]:
ds_chars = pd.read_csv('ds_chars.csv', index_col = 0)

In [81]:
states = pd.read_csv('states.csv', index_col = 0)

In [82]:
ds_chars.merge(states, left_on='home_state', right_on = 'state',
               how = 'inner')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,rachel,200,TX,TX,alamo,Austin
4,alison,300,DC,DC,district,Washington


### pd.concat()

$\bf{\rightarrow Exercise: Look\ up\ the documentation\ on\ pd.concat}$ (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) $\bf{and\ use\ it\ to\ concatenate\ ds\_chars\ and\ states.}$
<br/>
$\bf{Your\ result\ should\ still\ have\ only\ five\ rows!}$

In [86]:
df = pd.concat([ds_chars, states], axis =1)
df

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,TX,alamo,Austin
2,alan,170,TX,DC,district,Washington
3,alison,300,DC,OH,buckeye,Columbus
4,rachel,200,TX,OR,beaver,Salem


### pd.melt()

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [138]:
pd.melt(df, id_vars='name', value_vars='state', value_name='state_abbreviation')

Unnamed: 0,name,variable,state_abbreviation
0,greg,state,WA
1,miles,state,TX
2,alan,state,DC
3,alison,state,OH
4,rachel,state,OR


# Pair Programming:
    
For these exercises, we will be practicing pair programming. 
While we work through these exercises, choose who will code and who will supervise.
I.E., one person types, and the other suggests the appropriate direction to head in.

# Exercise 1

1. Make a new column which is the log of the cholesterol column.
2. Make another new column which raises e to the value of the cholesterol column.
3. Check the original column is equal to the second new column.

# Exercise 2

1: Split target off of the dataset.<br>
2: Use numpy to create a random subset of the target variables.<br>
3: Match the indices of each set to the indices of the features 
in order to make two corresponding feature sets.



# Exercise 3
1. Define a function which groups age into year groups of a size of your choosing.
2. Create a new column of binned ages.
3. Drop the original column.


# Exercise 4

1. Use numpy to create a random column of 0's and 1s.
2. Count the number of rows whose target column and new column have the same values.