# An introduction to the Pandas package 
This is a companion notebook for video content presented as part of the Data Mining course at SMU.

In this tutorial we will be looking at a number of different parts of the Pandas package for data analysis, including:
- Data Frames
 - loading data
 - head and tail commands
- Munging
 - indexing operations
 - basic statistics
 - encoding
 - imputation (optional, but recommended)


## Data Frames
Data frames in Pandas are basically like tables of data that you can do some really interesting relational database operations upon. There are many built in methods for aggregation and visualization, but we will cover those next time.

First lets load a typical table of data from a csv file. You can download the file from here:
https://archive.ics.uci.edu/ml/datasets/Heart+Disease

Make sure to place it in this directory or adjust the path for the file.
### Reading Data from CSV with Pandas

In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [1]:
# let's print out the first five rows inside a csv file

# NOTE: you may need to change the path to the file, 
#       depending on where you saved the data
with open('data/heart_disease.csv') as fid:
    for idx, row in enumerate(fid):
        print(row,end='')
        if idx >= 4:
            break

site,age,is_male,chest_pain,rest_blood_press,cholesterol,high_blood_sugar,rest_ecg,max_heart_rate,exer_angina,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
cleve,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
cleve,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
cleve,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
cleve,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0


In [2]:
# now let's read in the same data using pandas to save it as a dataframe
import pandas as pd

df = pd.read_csv('data/heart_disease.csv') # read in the csv file

In [4]:
# now lets look at the data
df.head(100)

Unnamed: 0,site,age,is_male,chest_pain,rest_blood_press,cholesterol,high_blood_sugar,rest_ecg,max_heart_rate,exer_angina,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
0,cleve,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,cleve,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,cleve,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,cleve,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,cleve,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
5,cleve,56,1,2,120,236,0,0,178,0,0.8,1,0,3,0
6,cleve,62,0,4,140,268,0,2,160,0,3.6,3,2,3,3
7,cleve,57,0,4,120,354,0,0,163,1,0.6,1,0,3,0
8,cleve,63,1,4,130,254,0,2,147,0,1.4,2,1,7,2
9,cleve,53,1,4,140,203,1,2,155,1,3.1,3,0,7,1


In [4]:
# now let's a get a summary of the variables 
print(df.info())
# we can see that most of the data 
#  is saved as an integer or as a nominal object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 15 columns):
site                 920 non-null object
age                  920 non-null int64
is_male              920 non-null int64
chest_pain           920 non-null int64
rest_blood_press     920 non-null object
cholesterol          920 non-null object
high_blood_sugar     920 non-null object
rest_ecg             920 non-null object
max_heart_rate       920 non-null object
exer_angina          920 non-null object
ST_depression        920 non-null object
Peak_ST_seg          920 non-null object
major_vessels        920 non-null object
thal                 920 non-null object
has_heart_disease    920 non-null int64
dtypes: int64(4), object(11)
memory usage: 107.9+ KB
None


This data has been read into working memory and is known as a DataFrame.

### Reading Data from SQLite3 with Pandas
We can also connect to a sqlite3 database using the built in sqlite3 package that ships with python. 

In [21]:
# but csv files are not the only thing we can work with
# what if the data was actually in a sqlite database?
del df
import sqlite3

con = sqlite3.connect('data/heart_disease_sql') # again this file is in the same directory
df = pd.read_sql('SELECT * FROM heart_disease', con)  # the table name is heart_disease
df.head()

Unnamed: 0,site,age,is_male,chest_pain,rest_blood_press,cholesterol,high_blood_sugar,rest_ecg,max_heart_rate,exer_angina,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
0,cleve,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,cleve,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,cleve,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,cleve,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,cleve,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


In [22]:
df[df.index==305]

Unnamed: 0,site,age,is_male,chest_pain,rest_blood_press,cholesterol,high_blood_sugar,rest_ecg,max_heart_rate,exer_angina,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
305,swiss,35,1,4,?,0,?,0,130,1,?,?,?,7,3


In [7]:
df.info()
# notice now, however, that the data types are all objects!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 15 columns):
site                 920 non-null object
age                  920 non-null object
is_male              920 non-null object
chest_pain           920 non-null object
rest_blood_press     920 non-null object
cholesterol          920 non-null object
high_blood_sugar     920 non-null object
rest_ecg             920 non-null object
max_heart_rate       920 non-null object
exer_angina          920 non-null object
ST_depression        920 non-null object
Peak_ST_seg          920 non-null object
major_vessels        920 non-null object
thal                 920 non-null object
has_heart_disease    920 non-null object
dtypes: object(15)
memory usage: 107.9+ KB


 ### Working with DataFrames
 We can index in to DataFrame in a number of ways:

In [6]:
# the variable names are embedded into the structure
print (df.age)
print (df['age']) # but can also be accessed using strings

0      63
1      67
2      67
3      37
4      41
5      56
6      62
7      57
8      63
9      53
10     57
11     56
12     56
13     44
14     52
15     57
16     48
17     54
18     48
19     49
20     64
21     58
22     58
23     58
24     60
25     50
26     58
27     66
28     43
29     40
       ..
890    52
891    53
892    53
893    54
894    55
895    55
896    55
897    56
898    56
899    56
900    58
901    59
902    59
903    65
904    66
905    41
906    43
907    44
908    47
909    47
910    49
911    49
912    50
913    50
914    52
915    52
916    54
917    56
918    58
919    65
Name: age, Length: 920, dtype: object
0      63
1      67
2      67
3      37
4      41
5      56
6      62
7      57
8      63
9      53
10     57
11     56
12     56
13     44
14     52
15     57
16     48
17     54
18     48
19     49
20     64
21     58
22     58
23     58
24     60
25     50
26     58
27     66
28     43
29     40
       ..
890    52
891    53
892    53
893    54
89

In [7]:
print (df.chest_pain.min(), df.chest_pain.max(), df.chest_pain.mean())

1 4 inf


In [11]:
 df.dtypes

site                 object
age                  object
is_male              object
chest_pain           object
rest_blood_press     object
cholesterol          object
high_blood_sugar     object
rest_ecg             object
max_heart_rate       object
exer_angina          object
ST_depression        object
Peak_ST_seg          object
major_vessels        object
thal                 object
has_heart_disease    object
dtype: object

In [5]:
# lets get rid of the 'site' variable
if 'site' in df:
    del df['site']

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null int64
is_male              920 non-null int64
chest_pain           920 non-null int64
rest_blood_press     920 non-null object
cholesterol          920 non-null object
high_blood_sugar     920 non-null object
rest_ecg             920 non-null object
max_heart_rate       920 non-null object
exer_angina          920 non-null object
ST_depression        920 non-null object
Peak_ST_seg          920 non-null object
major_vessels        920 non-null object
thal                 920 non-null object
has_heart_disease    920 non-null int64
dtypes: int64(4), object(10)
memory usage: 100.7+ KB


In [6]:
# Notice that all of the data is stored as a non-null object
# That's not good. It means we need to change those data types
# in order to encode the variables properly. Right now Pandas
# thinks all of our variables are nominal!

import numpy as np
# replace '?' with -1, we will deal with missing values later
df = df.replace(to_replace='?',value=-1) # don't set to NaN, becasue that's an error if data isn't float

# let's start by first changing the numeric values to be floats
continuous_features = ['rest_blood_press', 'cholesterol', 
                       'max_heart_rate', 'ST_depression']

# and the oridnal values to be integers
ordinal_features = ['age','major_vessels','chest_pain',
                    'rest_ecg','Peak_ST_seg','thal','has_heart_disease']

# we won't touch these variables, keep them as categorical
categ_features = ['is_male','high_blood_sugar','exer_angina']

# use the "astype" function to change the variable type
df[continuous_features] = df[continuous_features].astype(np.float64)
df[ordinal_features] = df[ordinal_features].astype(np.int64)

df.info() # now our data looks better!!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null int64
is_male              920 non-null int64
chest_pain           920 non-null int64
rest_blood_press     920 non-null float64
cholesterol          920 non-null float64
high_blood_sugar     920 non-null object
rest_ecg             920 non-null int64
max_heart_rate       920 non-null float64
exer_angina          920 non-null object
ST_depression        920 non-null float64
Peak_ST_seg          920 non-null int64
major_vessels        920 non-null int64
thal                 920 non-null int64
has_heart_disease    920 non-null int64
dtypes: float64(4), int64(8), object(2)
memory usage: 100.7+ KB


In [17]:
df.head()

Unnamed: 0,site,age,is_male,chest_pain,rest_blood_press,cholesterol,high_blood_sugar,rest_ecg,max_heart_rate,exer_angina,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
0,cleve,63,1,1,145.0,233.0,1,2,150.0,0,2.3,3,0,6,0
1,cleve,67,1,4,160.0,286.0,0,2,108.0,1,1.5,2,3,3,2
2,cleve,67,1,4,120.0,229.0,0,2,129.0,1,2.6,2,2,7,1
3,cleve,37,1,3,130.0,250.0,0,0,187.0,0,3.5,3,0,3,0
4,cleve,41,0,2,130.0,204.0,0,2,172.0,0,1.4,1,0,3,0


Let's get summary of all attributes in the frame

In [25]:
df.describe().T # will get summary of continuous or the nominals

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,920.0,53.51087,9.424685,28.0,47.0,54.0,60.0,77.0
chest_pain,920.0,3.25,0.930969,1.0,3.0,4.0,4.0,4.0
rest_blood_press,920.0,123.594565,37.484705,-1.0,120.0,130.0,140.0,200.0
cholesterol,920.0,192.604348,114.615011,-1.0,164.0,221.0,267.0,603.0
rest_ecg,920.0,0.601087,0.808415,-1.0,0.0,0.0,1.0,2.0
max_heart_rate,920.0,129.263043,41.376773,-1.0,115.0,138.0,156.0,202.0
ST_depression,920.0,0.752174,1.154353,-2.6,0.0,0.2,1.5,6.2
Peak_ST_seg,920.0,0.840217,1.403211,-1.0,-1.0,1.0,2.0,3.0
major_vessels,920.0,-0.436957,0.959656,-1.0,-1.0,-1.0,0.0,3.0
thal,920.0,1.871739,3.313649,-1.0,-1.0,-1.0,6.0,7.0


There are 920 entries in this data frame. Notice that this data frame has a number of missing values denoted by the value -1 (that we changed the '?' value to before). We need to either remove the missing values from the dataset OR we need to fill in with our best guess for those values. Let's first drop all the rows with missing values.

In [20]:
len(df[df.rest_blood_press==-1])

59

In [7]:
# how many value have the -1 (which we set as the missing values) 
import numpy as np

# let's set those values to NaN, so that Pandas understand they are missing
df = df.replace(to_replace=-1,value=np.nan) # replace -1 with NaN (not a number)
print (df.info())
df.describe() # scroll over to see the values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null int64
is_male              920 non-null int64
chest_pain           920 non-null int64
rest_blood_press     861 non-null float64
cholesterol          890 non-null float64
high_blood_sugar     830 non-null object
rest_ecg             918 non-null float64
max_heart_rate       865 non-null float64
exer_angina          865 non-null object
ST_depression        856 non-null float64
Peak_ST_seg          611 non-null float64
major_vessels        309 non-null float64
thal                 434 non-null float64
has_heart_disease    920 non-null int64
dtypes: float64(8), int64(4), object(2)
memory usage: 100.7+ KB
None


Unnamed: 0,age,is_male,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
count,920.0,920.0,920.0,861.0,890.0,918.0,865.0,856.0,611.0,309.0,434.0,920.0
mean,53.51087,0.78913,3.25,132.132404,199.130337,0.604575,137.545665,0.883178,1.770867,0.676375,5.087558,0.995652
std,9.424685,0.408148,0.930969,19.06607,110.78081,0.805827,25.926276,1.088707,0.619256,0.935653,1.919075,1.142693
min,28.0,0.0,1.0,0.0,0.0,0.0,60.0,-2.6,1.0,0.0,3.0,0.0
25%,47.0,1.0,3.0,120.0,175.0,0.0,120.0,0.0,1.0,0.0,3.0,0.0
50%,54.0,1.0,4.0,130.0,223.0,0.0,140.0,0.5,2.0,0.0,6.0,1.0
75%,60.0,1.0,4.0,140.0,268.0,1.0,157.0,1.5,2.0,1.0,7.0,2.0
max,77.0,1.0,4.0,200.0,603.0,2.0,202.0,6.2,3.0,3.0,7.0,4.0


Wow. Notice how the number of attributes went down in the description function. Looks like we need to impute values. If we drop the rows with missing data, we will be throwing away almost 80% of the data collected. No way!!

### Imputation of NaN values (Optional)

In [27]:
# lets look at some stats of the data
df.median() # only calculates for numeric data

age                   54.0
is_male                1.0
chest_pain             4.0
rest_blood_press     130.0
cholesterol          223.0
high_blood_sugar       0.0
rest_ecg               0.0
max_heart_rate       140.0
exer_angina            0.0
ST_depression          0.5
Peak_ST_seg            2.0
major_vessels          0.0
thal                   6.0
has_heart_disease      1.0
dtype: float64

In [8]:
# the 'fillna' function will take the given series (the output above)
# and fill in the missing values for the columns it has
df_imputed = df.fillna(df.median()) # note that to do this all values must be numeric
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null int64
is_male              920 non-null int64
chest_pain           920 non-null int64
rest_blood_press     920 non-null float64
cholesterol          920 non-null float64
high_blood_sugar     920 non-null object
rest_ecg             920 non-null float64
max_heart_rate       920 non-null float64
exer_angina          920 non-null object
ST_depression        920 non-null float64
Peak_ST_seg          920 non-null float64
major_vessels        920 non-null float64
thal                 920 non-null float64
has_heart_disease    920 non-null int64
dtypes: float64(8), int64(4), object(2)
memory usage: 100.7+ KB


Notice that the object variables are unchanged, but all the numeric/ordinal values have been filled in with the median of the columns. Let's try something (slightly) smarter, and fill in the oridinals with the median and the continuous with the mean.

In [20]:
# make  one series for imputing with
series_mean = df[continuous_features].mean()
series_median = df[categ_features+ordinal_features].median()
cat_series = pd.concat((series_median,series_mean))

print (cat_series)

is_male                1.000000
high_blood_sugar       0.000000
exer_angina            0.000000
age                   54.000000
major_vessels          0.000000
chest_pain             4.000000
rest_ecg               0.000000
Peak_ST_seg            2.000000
thal                   6.000000
has_heart_disease      1.000000
rest_blood_press     132.132404
cholesterol          199.130337
max_heart_rate       137.545665
ST_depression          0.883178
dtype: float64


In [10]:
# now let's impute the numbers a bit differently

df_imputed = df.fillna(value=cat_series)
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 920 entries, 0 to 919
Data columns (total 14 columns):
age                  920 non-null int64
is_male              920 non-null int64
chest_pain           920 non-null int64
rest_blood_press     920 non-null float64
cholesterol          920 non-null float64
high_blood_sugar     920 non-null object
rest_ecg             920 non-null float64
max_heart_rate       920 non-null float64
exer_angina          920 non-null object
ST_depression        920 non-null float64
Peak_ST_seg          920 non-null float64
major_vessels        920 non-null float64
thal                 920 non-null float64
has_heart_disease    920 non-null int64
dtypes: float64(8), int64(4), object(2)
memory usage: 100.7+ KB


In [19]:
df_imputed[categ_features].describe()

Unnamed: 0,is_male
count,920.0
mean,0.78913
std,0.408148
min,0.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


In [27]:
# what if we want to impute values after grouping? Split-Impute-Combine
df_grouped = df.groupby(by=['is_male','high_blood_sugar'])

In [13]:
df_grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x000001DC2C5D3B00>

In [28]:
# now use this grouping to fill the data set in each group, then transform back
df_imputed = df_grouped.transform(lambda grp: grp.fillna(grp.mean()))
print(df_imputed.info())
print('------------')

# the above process can remove columns, so let's find that and fix it
names_removed = df.columns - df_imputed.columns # use list like a set operator
print(names_removed)
print('------------')

df_imputed[names_removed] = df[names_removed]
print(df_imputed.info())



TypeError: Transform function invalid for data types

### Indexing logically into Data Frames
Let's now say that we are only interested in the summary of the dataframe when the patient has heart disease. We can achieve this using a few line of code:

In [24]:
df_imputed[df_imputed.has_heart_disease==0].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,411.0,50.547445,9.4337,28.0,43.0,51.0,57.0,76.0
is_male,411.0,0.649635,0.477666,0.0,0.0,1.0,1.0,1.0
chest_pain,411.0,2.761557,0.903425,1.0,2.0,3.0,4.0,4.0
rest_blood_press,411.0,130.021042,16.460208,80.0,120.0,130.0,140.0,190.0
cholesterol,411.0,226.575368,74.301504,0.0,199.130337,225.0,266.0,564.0
rest_ecg,411.0,0.547445,0.805204,0.0,0.0,0.0,1.0,2.0
max_heart_rate,411.0,148.25283,23.152969,69.0,135.5,150.0,165.0,202.0
ST_depression,411.0,0.441963,0.704565,-1.1,0.0,0.0,0.883178,4.2
Peak_ST_seg,411.0,1.729927,0.515662,1.0,1.0,2.0,2.0,3.0
major_vessels,411.0,0.111922,0.427276,0.0,0.0,0.0,0.0,3.0


In [26]:
# or we can use the extremely useful "groupby" function
df_imputed.groupby(by='has_heart_disease').median()

Unnamed: 0_level_0,age,is_male,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal
has_heart_disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,51,1,3,130.0,225.0,0.0,150.0,0.0,2.0,0.0,6.0
1,55,1,4,130.0,226.0,0.0,130.0,1.0,2.0,0.0,6.0
2,58,1,4,132.132404,193.0,0.0,130.0,1.4,2.0,0.0,6.0
3,60,1,4,132.132404,212.0,1.0,122.0,1.0,2.0,0.0,6.0
4,59,1,4,133.066202,218.5,1.0,126.5,2.45,2.0,0.0,6.0


In [25]:
df_imputed.groupby(['has_heart_disease']).median()

Unnamed: 0_level_0,age,is_male,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal
has_heart_disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,51,1,3,130.0,225.0,0.0,150.0,0.0,2.0,0.0,6.0
1,55,1,4,130.0,226.0,0.0,130.0,1.0,2.0,0.0,6.0
2,58,1,4,132.132404,193.0,0.0,130.0,1.4,2.0,0.0,6.0
3,60,1,4,132.132404,212.0,1.0,122.0,1.0,2.0,0.0,6.0
4,59,1,4,133.066202,218.5,1.0,126.5,2.45,2.0,0.0,6.0


In [33]:
df_imputed.groupby(by=df_imputed.has_heart_disease>0).mean()

Unnamed: 0_level_0,age,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
has_heart_disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
False,50.547445,2.761557,130.075507,227.713509,0.547445,148.235344,0.445843,1.610641,0.491867,4.486785,0.0
True,55.903733,3.644401,134.100951,177.191037,0.650749,128.757144,1.251465,1.908871,0.86324,5.697933,1.799607


In [34]:
df_imputed.groupby(by=df_imputed.major_vessels>2).mean()

Unnamed: 0_level_0,age,chest_pain,rest_blood_press,cholesterol,rest_ecg,max_heart_rate,ST_depression,Peak_ST_seg,major_vessels,thal,has_heart_disease
major_vessels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
False,53.368889,3.241111,132.198349,198.301656,0.587965,137.523459,0.867799,1.772801,0.629447,5.122661,0.966667
True,59.9,3.65,136.9,265.45,1.35,134.55,1.865,1.85,3.0,5.7,2.3


In [None]:
df_imputed

### One Hot Encoding of Categorical Variables

In [35]:
# one hot encoded variables can be created using the get_dummies variable
tmpdf = pd.get_dummies(df_imputed['chest_pain'],prefix='chest')

tmpdf.head()

Unnamed: 0,chest_1,chest_2,chest_3,chest_4
0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0


In [36]:
#one hot encoding of ALL categorical variables
# there is lot going on in this one line of code, so let's step through it

# pd.concat([*]], axis=1) // this line of code concatenates all the data frames in the [*] list
# [** for col in categ_features] // this steps through each feature in categ_features and 
#                                //   creates a new element in a list based on the output of **
# pd.get_dummies(df_imputed[col],prefix=col) // this creates a one hot encoded dataframe of the variable=col (like code above)

one_hot_df = pd.concat([pd.get_dummies(df_imputed[col],prefix=col) for col in categ_features], axis=1)

one_hot_df.head()

Unnamed: 0,is_male_0,is_male_1,high_blood_sugar_0,high_blood_sugar_1,exer_angina_0,exer_angina_1
0,0.0,1.0,0.0,1.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,1.0
2,0.0,1.0,1.0,0.0,0.0,1.0
3,0.0,1.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,1.0,0.0
