# Data Mining, an introduction to the Pandas package 
This is a companion notebook for video content presented as part of the Data Mining course at SMU.

In this tutorial we will be looking at a number of different parts of the Pandas package for data analysis, including:
- Data Frames
 - loading data
 - head and tail commands
- Munging
 - indexing operations
 - basic statistics
 - encoding
 - imputation (optional)
- bonus: calling R with magics

## Data Frames
Data frames in Pandas are basically like tables of data that you can do some really interesting relational database operations upon. There are many built in methods for aggregation and visualization, but we will cover those next time.

First lets load a typical table of data from a csv file. You can download the file from here:
https://archive.ics.uci.edu/ml/datasets/Heart+Disease

Make sure to place it in this directory or adjust the path for the file.
### Reading Data from CSV with Pandas

In [1]:
# let's print out the first five rows inside a csv file

# NOTE: you may need to change the path to the file, 
#       depending on where you saved the data
with open('data/heart_disease.csv') as fid:
    for idx, row in enumerate(fid):
        print row,
        if idx >= 4:
            break

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-1-32baf2069d79>, line 7)

In [2]:
# now let's read in the same data using pandas to save it as a dataframe
import pandas as pd

df = pd.read_csv('data/heart_disease.csv') # read in the csv file

In [3]:
# now lets look at the data
df.head()

Unnamed: 0,site,age,is_male,chest_pain,rest_blood_press,cholesterol,high_blood_sugar,rest_ecg,max_heart_rate,exer_angina,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
0,cleve,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,cleve,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,cleve,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,cleve,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,cleve,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


In [4]:
# now let's a get a summary of the variables 
print df.info()
# we can see that most of the data 
#  is saved as an integer or as a nominal object

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 919
Data columns (total 15 columns):
site                 920 non-null object
age                  920 non-null int64
is_male              920 non-null int64
chest_pain           920 non-null int64
rest_blood_press     920 non-null object
cholesterol          920 non-null object
high_blood_sugar     920 non-null object
rest_ecg             920 non-null object
max_heart_rate       920 non-null object
exer_angina          920 non-null object
ST_depression        920 non-null object
Peak_ST_seg          920 non-null object
major_vessels        920 non-null object
thal                 920 non-null object
has_heart_disease    920 non-null int64
dtypes: int64(4), object(11)None


This data has been read into working memory and is known as a DataFrame.

### Reading Data from SQLite3 with Pandas
We can also connect to a sqlite3 database using the built in sqlite3 package that ships with python. 

In [5]:
# but csv files are not the only thing we can work with
# what if the data was actually in a sqlite database?
del df
import sqlite3

con = sqlite3.connect('data/heart_disease_sql') # again this file is in the same directory
df = pd.read_sql('SELECT * FROM heart_disease', con)  # the table name is heart_disease
df.head()

Unnamed: 0,site,age,is_male,chest_pain,rest_blood_press,cholesterol,high_blood_sugar,rest_ecg,max_heart_rate,exer_angina,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
0,cleve,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,cleve,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,cleve,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,cleve,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,cleve,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


In [6]:
df.info()
# notice now, however, that the data types are all objects!

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 919
Data columns (total 15 columns):
site                 920 non-null object
age                  920 non-null object
is_male              920 non-null object
chest_pain           920 non-null object
rest_blood_press     920 non-null object
cholesterol          920 non-null object
high_blood_sugar     920 non-null object
rest_ecg             920 non-null object
max_heart_rate       920 non-null object
exer_angina          920 non-null object
ST_depression        920 non-null object
Peak_ST_seg          920 non-null object
major_vessels        920 non-null object
thal                 920 non-null object
has_heart_disease    920 non-null object
dtypes: object(15)

 ###Working with DataFrames
 We can index in to DataFrame in a number of ways:

In [7]:
# the variable names are embedded into the structure
print df.age
print df['age'] # but can also be accessed using strings

0     63
1     67
2     67
3     37
4     41
5     56
6     62
7     57
8     63
9     53
10    57
11    56
12    56
13    44
14    52
...
905    41
906    43
907    44
908    47
909    47
910    49
911    49
912    50
913    50
914    52
915    52
916    54
917    56
918    58
919    65
Name: age, Length: 920, dtype: object
0     63
1     67
2     67
3     37
4     41
5     56
6     62
7     57
8     63
9     53
10    57
11    56
12    56
13    44
14    52
...
905    41
906    43
907    44
908    47
909    47
910    49
911    49
912    50
913    50
914    52
915    52
916    54
917    56
918    58
919    65
Name: age, Length: 920, dtype: object


In [8]:
print df.chest_pain.min(), df.chest_pain.max(), df.chest_pain.mean()

1 4 inf


In [9]:
# lets get rid of the 'site' variable
if 'site' in df:
    del df['site']

print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null object
is_male              920 non-null object
chest_pain           920 non-null object
rest_blood_press     920 non-null object
cholesterol          920 non-null object
high_blood_sugar     920 non-null object
rest_ecg             920 non-null object
max_heart_rate       920 non-null object
exer_angina          920 non-null object
ST_depression        920 non-null object
Peak_ST_seg          920 non-null object
major_vessels        920 non-null object
thal                 920 non-null object
has_heart_disease    920 non-null object
dtypes: object(14)None


In [10]:
# Notice that all of the data is stored as a non-null object
# That's not good. It means we need to change those data types
# in order to encode the variables properly. Right now Pandas
# thinks all of our variables are nominal!

import numpy as np
# replace '?' with -1, we will deal with missing values later
df = df.replace(to_replace='?',value=-1) 

# let's start by first changing the numeric values to be floats
continuous_features = ['rest_blood_press', 'cholesterol', 
                       'max_heart_rate', 'ST_depression']

# and the oridnal values to be integers
ordinal_features = ['age','major_vessels','chest_pain',
                    'rest_ecg','Peak_ST_seg','thal','has_heart_disease']

# we won't touch these variables, keep them as categorical
categ_features = ['is_male','high_blood_sugar','exer_angina'];

# use the "astype" function to change the variable type
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)

df.info() # now our data looks better!!

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null int64
is_male              920 non-null object
chest_pain           920 non-null int64
rest_blood_press     920 non-null float64
cholesterol          920 non-null float64
high_blood_sugar     920 non-null object
rest_ecg             920 non-null int64
max_heart_rate       920 non-null float64
exer_angina          920 non-null object
ST_depression        920 non-null float64
Peak_ST_seg          920 non-null int64
major_vessels        920 non-null int64
thal                 920 non-null int64
has_heart_disease    920 non-null int64
dtypes: float64(4), int64(7), object(3)

In [11]:
df.head()

Unnamed: 0,age,is_male,chest_pain,rest_blood_press,cholesterol,high_blood_sugar,rest_ecg,max_heart_rate,exer_angina,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


Let's get summary of all attributes in the frame

In [12]:
df.describe() # will get summary of continuous or the nominals

Unnamed: 0,age,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
count,920.0,920.0,920.0,920.0,920.0,920.0,920.0,920.0,920.0,920.0,920.0
mean,53.51087,3.25,123.594565,192.604348,0.601087,129.263043,0.752174,0.840217,-0.436957,1.871739,0.995652
std,9.424685,0.930969,37.484705,114.615011,0.808415,41.376773,1.154353,1.403211,0.959656,3.313649,1.142693
min,28.0,1.0,-1.0,-1.0,-1.0,-1.0,-2.6,-1.0,-1.0,-1.0,0.0
25%,47.0,3.0,120.0,164.0,0.0,115.0,0.0,-1.0,-1.0,-1.0,0.0
50%,54.0,4.0,130.0,221.0,0.0,138.0,0.2,1.0,-1.0,-1.0,1.0
75%,60.0,4.0,140.0,267.0,1.0,156.0,1.5,2.0,0.0,6.0,2.0
max,77.0,4.0,200.0,603.0,2.0,202.0,6.2,3.0,3.0,7.0,4.0


There are 920 entries in this data frame. Notice that this data frame has a number of missing values denoted by the value -1 (that we changed the '?' value to before). We need to either remove the missing values from the dataset OR we need to fill in with our best guess for those values. Let's first drop all the rows with missing values.

In [13]:
# how many value have the -1 (which we set as the missing values) 
import numpy as np

# let's set those values to NaN, so that Pandas understand they are missing
df = df.replace(to_replace=-1,value=np.nan) # replace -1 with NaN (not a number)
print df.info()
df.describe() # scroll over to see the values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null int64
is_male              920 non-null object
chest_pain           920 non-null int64
rest_blood_press     861 non-null float64
cholesterol          890 non-null float64
high_blood_sugar     830 non-null object
rest_ecg             918 non-null float64
max_heart_rate       865 non-null float64
exer_angina          865 non-null object
ST_depression        856 non-null float64
Peak_ST_seg          611 non-null float64
major_vessels        309 non-null float64
thal                 434 non-null float64
has_heart_disease    920 non-null int64
dtypes: float64(8), int64(3), object(3)None


Unnamed: 0,age,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
count,920.0,920.0,861.0,890.0,918.0,865.0,856.0,611.0,309.0,434.0,920.0
mean,53.51087,3.25,132.132404,199.130337,0.604575,137.545665,0.883178,1.770867,0.676375,5.087558,0.995652
std,9.424685,0.930969,19.06607,110.78081,0.805827,25.926276,1.088707,0.619256,0.935653,1.919075,1.142693
min,28.0,1.0,0.0,0.0,0.0,60.0,-2.6,1.0,0.0,3.0,0.0
25%,47.0,3.0,120.0,175.0,0.0,120.0,0.0,1.0,0.0,3.0,0.0
50%,54.0,4.0,130.0,223.0,0.0,140.0,0.5,2.0,0.0,6.0,1.0
75%,60.0,4.0,140.0,268.0,1.0,157.0,1.5,2.0,1.0,7.0,2.0
max,77.0,4.0,200.0,603.0,2.0,202.0,6.2,3.0,3.0,7.0,4.0


Wow. Notice how the number of attributes went down in the description function. Looks like we need to impute values. If we drop the rows with missing data, we will be throwing away almost 80% of the data collected. No way!!

###Imputation of NaN values (Optional)

In [14]:
# lets look at some stats of the data
df.median() # only calculates for numeric data

age                   54.0
is_male                1.0
chest_pain             4.0
rest_blood_press     130.0
cholesterol          223.0
high_blood_sugar       0.0
rest_ecg               0.0
max_heart_rate       140.0
exer_angina            0.0
ST_depression          0.5
Peak_ST_seg            2.0
major_vessels          0.0
thal                   6.0
has_heart_disease      1.0
dtype: float64

In [15]:
# the 'fillna' function will take the given series (the output above)
# and fill in the missing values for the columns it has
df_imputed = df.fillna(df.median()) # note that to do this all values must be numeric
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null int64
is_male              920 non-null object
chest_pain           920 non-null int64
rest_blood_press     920 non-null float64
cholesterol          920 non-null float64
high_blood_sugar     920 non-null object
rest_ecg             920 non-null float64
max_heart_rate       920 non-null float64
exer_angina          920 non-null object
ST_depression        920 non-null float64
Peak_ST_seg          920 non-null float64
major_vessels        920 non-null float64
thal                 920 non-null float64
has_heart_disease    920 non-null int64
dtypes: float64(8), int64(3), object(3)

Notice that the object variables are unchanged, but all the numeric/ordinal values have been filled in with the median of the columns. Let's try something (slightly) smarter, and fill in the oridinals with the median and the continuous with the mean.

In [16]:
# make  one series for imputing with
series_mean = df[continuous_features].mean()
series_median = df[categ_features+ordinal_features].median()
cat_series = pd.concat((series_median,series_mean))

print cat_series

is_male                1.000000
high_blood_sugar       0.000000
exer_angina            0.000000
age                   54.000000
major_vessels          0.000000
chest_pain             4.000000
rest_ecg               0.000000
Peak_ST_seg            2.000000
thal                   6.000000
has_heart_disease      1.000000
rest_blood_press     132.132404
cholesterol          199.130337
max_heart_rate       137.545665
ST_depression          0.883178
dtype: float64


In [17]:
# now let's impute the numbers a bit differently

df_imputed = df.fillna(value=cat_series)
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null int64
is_male              920 non-null object
chest_pain           920 non-null int64
rest_blood_press     920 non-null float64
cholesterol          920 non-null float64
high_blood_sugar     920 non-null object
rest_ecg             920 non-null float64
max_heart_rate       920 non-null float64
exer_angina          920 non-null object
ST_depression        920 non-null float64
Peak_ST_seg          920 non-null float64
major_vessels        920 non-null float64
thal                 920 non-null float64
has_heart_disease    920 non-null int64
dtypes: float64(8), int64(3), object(3)

In [18]:
df_imputed[categ_features].describe()

Unnamed: 0,is_male,high_blood_sugar,exer_angina
count,920,920,920
unique,2,3,3
top,1,0,0
freq,726,692,528


###Indexing logically into Data Frames
Let's now say that we are only interested in the summary of the dataframe when the patient has heart disease. We can achieve this using a few line of code:

In [19]:
df_imputed[df_imputed.has_heart_disease==0].describe()

Unnamed: 0,age,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
count,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411.0,411
mean,50.547445,2.761557,130.021042,226.575368,0.547445,148.25283,0.441963,1.729927,0.111922,5.085158,0
std,9.4337,0.903425,16.460208,74.301504,0.805204,23.152969,0.704565,0.515662,0.427276,1.510951,0
min,28.0,1.0,80.0,0.0,0.0,69.0,-1.1,1.0,0.0,3.0,0
25%,43.0,2.0,120.0,199.130337,0.0,135.5,0.0,1.0,0.0,3.0,0
50%,51.0,3.0,130.0,225.0,0.0,150.0,0.0,2.0,0.0,6.0,0
75%,57.0,4.0,140.0,266.0,1.0,165.0,0.883178,2.0,0.0,6.0,0
max,76.0,4.0,190.0,564.0,2.0,202.0,4.2,3.0,3.0,7.0,0


In [20]:
# or we can use the extremely useful "groupby" function
df_imputed.groupby(by='has_heart_disease').median()

Unnamed: 0_level_0,age,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal
has_heart_disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,51,3,130.0,225.0,0,150.0,0.0,2,0,6
1,55,4,130.0,226.0,0,130.0,1.0,2,0,6
2,58,4,132.132404,193.0,0,130.0,1.4,2,0,6
3,60,4,132.132404,212.0,1,122.0,1.0,2,0,6
4,59,4,133.066202,218.5,1,126.5,2.45,2,0,6


In [21]:
df_imputed.groupby(by=df_imputed.has_heart_disease>0).mean()

Unnamed: 0_level_0,age,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
has_heart_disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
False,50.547445,2.761557,130.021042,226.575368,0.547445,148.25283,0.441963,1.729927,0.111922,5.085158,0.0
True,55.903733,3.644401,133.837257,176.969418,0.64833,128.899997,1.239443,1.943026,0.320236,5.960707,1.799607


In [22]:
df_imputed.groupby(by=df_imputed.major_vessels>2).mean()

Unnamed: 0_level_0,age,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
major_vessels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
False,53.368889,3.241111,132.026458,197.656567,0.586667,137.612235,0.861359,1.847778,0.165556,5.566667,0.966667
True,59.9,3.65,136.9,265.45,1.35,134.55,1.865,1.85,3.0,5.7,2.3


###One Hot Encoding of Categorical Variables

In [23]:
# one hot encoded variables can be created using the get_dummies variable
tmpdf = pd.get_dummies(df_imputed['chest_pain'],prefix='chest')

tmpdf.head()

Unnamed: 0,chest_1,chest_2,chest_3,chest_4
0,1,0,0,0
1,0,0,0,1
2,0,0,0,1
3,0,0,1,0
4,0,1,0,0


In [24]:
#one hot encoding of ALL categorical variables
# there is lot going on in this one line of code, so let's step through it

# pd.concat([*]], axis=1) // this line of code concatenates all the data frames in the [*] list
# [** for col in categ_features] // this steps through each feature in categ_features and 
#                                //   creates a new element in a list based on the output of **
# pd.get_dummies(df_imputed[col],prefix=col) // this creates a one hot encoded dataframe of the variable=col (like code above)

one_hot_df = pd.concat([pd.get_dummies(df_imputed[col],prefix=col) for col in categ_features], axis=1)

one_hot_df.head()

Unnamed: 0,is_male_0,is_male_1,high_blood_sugar_0.0,high_blood_sugar_0,high_blood_sugar_1,exer_angina_0.0,exer_angina_0,exer_angina_1
0,0,1,0,0,1,0,1,0
1,0,1,0,1,0,0,0,1
2,0,1,0,1,0,0,0,1
3,0,1,0,1,0,0,1,0
4,1,0,0,1,0,0,1,0


## Calling R from iPython

- Note: you will need R installed on your machine to run these!!

iPython has a lot of interesting "magics" built in. If you use R and have it installed on your machine, then you can write and look at R code directly from iPython cells. R also uses data frames, which we can push data into directly from the Pandas object we are using:

In [25]:
# CONVERT PANDAS DATAFRAME TO R DATA.FRAME
# adapted from: http://tagteam.harvard.edu/hub_feeds/1981/feed_items/196017
# I have better luck with both calls here

%load_ext rmagic
%load_ext rpy2.ipython

df_colnames = df_imputed.columns

http://rpy.sourceforge.net/


In [26]:
df_colnames

Index([u'age', u'is_male', u'chest_pain', u'rest_blood_press', u'cholesterol', u'high_blood_sugar', u'rest_ecg', u'max_heart_rate', u'exer_angina', u'ST_depression', u'Peak_ST_seg', u'major_vessels', u'thal', u'has_heart_disease'], dtype='object')

Now lets take the data frame from pandas and tell Rmagics that we want to have variables available in the R workspace. We use the %%R command to tell iPython that the entire cell is R code. The "-i" tells Rmagics that we want to transfer those variables over to R.

The following code will take the variables df_imputed and df_colnames into the R workspace and test if they are truly saved as R data.frames type variables.

In [27]:
%%R -i df_imputed,df_colnames 

colnames(df_imputed) <- unlist(df_colnames); 
print(is.data.frame(df_imputed))

[1] TRUE


Theay were data.frames! Great. Let's call an R function on the data.frame.

In [28]:
%%R -i df_imputed 
print(summary(df_imputed))

      age        is_male   chest_pain   rest_blood_press  cholesterol   
 Min.   :28.00   0:194   Min.   :1.00   Min.   :  0.0    Min.   :  0.0  
 1st Qu.:47.00   1:726   1st Qu.:3.00   1st Qu.:120.0    1st Qu.:177.8  
 Median :54.00           Median :4.00   Median :130.0    Median :221.0  
 Mean   :53.51           Mean   :3.25   Mean   :132.1    Mean   :199.1  
 3rd Qu.:60.00           3rd Qu.:4.00   3rd Qu.:140.0    3rd Qu.:267.0  
 Max.   :77.00           Max.   :4.00   Max.   :200.0    Max.   :603.0  
 high_blood_sugar    rest_ecg      max_heart_rate  exer_angina
 0  :692          Min.   :0.0000   Min.   : 60.0   0  :528    
 0.0: 90          1st Qu.:0.0000   1st Qu.:120.0   0.0: 55    
 1  :138          Median :0.0000   Median :138.0   1  :337    
                  Mean   :0.6033   Mean   :137.5              
                  3rd Qu.:1.0000   3rd Qu.:156.0              
                  Max.   :2.0000   Max.   :202.0              
 ST_depression      Peak_ST_seg    major_vessels

So we are able to call some R and get console output, now let's make some changes to the data.fram in R and print the result back in python.

In [29]:
print 'original:', df_imputed.age.head()

# give df_imputed, then multiply it by to in R
# the %R command tells iPython its just one line of R code
%R -i df_imputed df_imputed$age <- df_imputed$age*2

# now we are back in python, did it change?
print 'after manipulation in R:', df_imputed.age.head()

original: 0    63
1    67
2    67
3    37
4    41
Name: age, dtype: int64
after manipulation in R: 0    63
1    67
2    67
3    37
4    41
Name: age, dtype: int64


Well, it looks like the data was not synchronized... So instead let's setup an output variable for the DataFrame that we send into R. `-i df_imputed` means that we are sending in the DataFrame as an R data.frame. `-o df_imputed` means we are also getting the same variable and copying it back to the python workspace.

In [30]:
print 'original:', df_imputed.age.head() 

# This is the same code as before, but now with an output variable
%R -i df_imputed -o df_imputed  df_imputed$age <- df_imputed$age*2
# you can place the above on any line to make sure that the data stays
# synchronized between pandas and python
print 'after manipulation in R:', df_imputed.age.head()

original: 0    63
1    67
2    67
3    37
4    41
Name: age, dtype: int64
after manipulation in R: 0    126
1    134
2    134
3     74
4     82
Name: age, dtype: float64


Awesome. So now we can send DataFrames into R, manipulate them, and get them back into the python workspace. Is this memory hogging? Yes. Is it really useful for when you want to connect and work with different parts of R? You betcha.

In [31]:
# We can also just go and get new variables from R and 
# have them spit them back out for us
# here I am sending in df_imputed and getting back a data frame
# created in R
%R -i df_imputed -o df_from_R df_from_R <- df_imputed

# notice that the only differebce is that the integers are 32 bits
df_from_R.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null float64
is_male              920 non-null object
chest_pain           920 non-null int32
rest_blood_press     920 non-null float64
cholesterol          920 non-null float64
high_blood_sugar     920 non-null object
rest_ecg             920 non-null float64
max_heart_rate       920 non-null float64
exer_angina          920 non-null object
ST_depression        920 non-null float64
Peak_ST_seg          920 non-null float64
major_vessels        920 non-null float64
thal                 920 non-null float64
has_heart_disease    920 non-null int32
dtypes: float64(9), int32(2), object(3)

That's it. Use this as a reference sheet for Pandas, some basic imputation, and calling R code. Thanks!