# Pandas Tips & Review
### As a data scientist, you will be utilizing the pandas library in various ways to solve some of the world's most pressing issues

  
### Here are some useful pandas methods to manipulate dataframes to do and show us whatever we want!

![Alt Text](https://media.giphy.com/media/FYnYcQoFmlPeU/giphy.gif)

In [1]:
import pandas as pd
import numpy as np

#### Let's use the merge function from pandas to combine two dataframes together

In [20]:
df1 = pd.read_csv('heart.csv')
df1.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [30]:
df2 = pd.DataFrame(np.random.randint(0,100,size=(303,3)), columns=['B','C','D'])

In [31]:
df3 = pd.merge(df1,df2, on=df1.index)
df3

Unnamed: 0,key_0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,B,C,D
0,0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,52,67,97
1,1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,95,58,72
2,2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,89,52,87
3,3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,47,51,80
4,4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,72,83,29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0,23,5,34
299,299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0,83,7,43
300,300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0,19,83,6
301,301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0,81,52,34


Answer Below

In [None]:
df3 = pd.merge(df1,df2, on=df1.index)

In [None]:
df3

#### What if we wanted to simply add on a few relevant columns with data to our dataframe? (Hint: Use Concat)

In [32]:
lab_df = pd.DataFrame(np.random.randint(0,100,size=(303,3)), columns=['Chemical_A','Chemical_B','Chemical_C'])

In [44]:
concat_df = pd.concat([df2,lab_df], axis=1)
concat_df

Unnamed: 0,B,C,D,Chemical_A,Chemical_B,Chemical_C
0,52,67,97,80,29,34
1,95,58,72,41,44,22
2,89,52,87,81,65,69
3,47,51,80,37,76,50
4,72,83,29,28,78,11
...,...,...,...,...,...,...
298,23,5,34,10,80,35
299,83,7,43,26,23,59
300,19,83,6,91,25,66
301,81,52,34,99,67,29


Answer Below

In [None]:
concated_df = pd.concat([df2,lab_df],axis=1)

In [None]:
concated_df

### So what exactly does the join method do? Use the join method utilizing the 'left join'

In [56]:
joined_df = df1.join(df2, on=df1.index, how='left')
joined_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,B,C,D
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,52,67,97
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,95,58,72
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,89,52,87
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,47,51,80
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,72,83,29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0,23,5,34
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0,83,7,43
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0,19,83,6
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0,81,52,34


Answer Below

In [None]:
joined_df = df1.join(df2, on=df1.index,how='left')

In [None]:
joined_df

## Now let's talk about .loc vs .iloc 

The Pandas loc indexer can be used with DataFrames for two different use cases: 

a.) Selecting rows by label/index  
b.) Selecting rows with a boolean / conditional lookup

a.) Selecting rows by label/index

In [59]:
# Select rows with index values '0' through '100', with all columns between 'age' and 'chol'
joined_df.loc[0:100, 'age':'chol']

Unnamed: 0,age,sex,cp,trestbps,chol
0,63,1,3,145,233
1,37,1,2,130,250
2,41,0,1,130,204
3,56,1,1,120,236
4,57,0,0,120,354
...,...,...,...,...,...
96,62,0,0,140,394
97,52,1,0,108,233
98,43,1,2,130,315
99,53,1,2,130,246


>Notice below .loc[487] does not grab row 487, but rather the index of 487

In [61]:
joined_df.loc[300]

age          68.0
sex           1.0
cp            0.0
trestbps    144.0
chol        193.0
fbs           1.0
restecg       1.0
thalach     141.0
exang         0.0
oldpeak       3.4
slope         1.0
ca            2.0
thal          3.0
target        0.0
B            19.0
C            83.0
D             6.0
Name: 300, dtype: float64

b.) Selecting rows with a boolean / conditional lookup

In [63]:
# select rows with indexes between 100 and 200, and just return 'age' and 'cp' columns
joined_df.loc[(joined_df.index > 100) & (joined_df.index <= 200), ['age', 'cp']] 

Unnamed: 0,age,cp
101,59,3
102,63,1
103,42,2
104,50,2
105,68,2
...,...,...
196,46,2
197,67,0
198,62,0
199,65,0


The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position.

##### Use .iloc to grab specific rows

First 4 rows

In [64]:
joined_df.iloc[0:4]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,B,C,D
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,52,67,97
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,95,58,72
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,89,52,87
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,47,51,80


In [66]:
joined_df.iloc[5:18]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,B,C,D
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1,68,24,65
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1,69,63,46
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1,43,19,65
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1,86,60,32
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1,49,42,70
10,54,1,0,140,239,0,1,160,0,1.2,2,0,2,1,33,85,41
11,48,0,2,130,275,0,1,139,0,0.2,2,0,2,1,73,84,76
12,49,1,1,130,266,0,1,171,0,0.6,2,0,2,1,99,13,38
13,64,1,3,110,211,0,0,144,1,1.8,1,0,2,1,58,41,95
14,58,0,3,150,283,1,0,162,0,1.0,2,0,2,1,52,55,63


First 3 columns

In [65]:
joined_df.iloc[:,0:3]

Unnamed: 0,age,sex,cp
0,63,1,3
1,37,1,2
2,41,0,1
3,56,1,1
4,57,0,0
...,...,...,...
298,57,0,0
299,45,1,3
300,68,1,0
301,57,1,0


Or grab rows (0,3,6,24) from columns (0,5,6)

In [67]:
joined_df.iloc[[0,3,6,24], [0,5,6]] 

Unnamed: 0,age,fbs,restecg
0,63,1,0
3,56,0,1
6,56,0,0
24,40,0,1


# Ways to utilize lambda functions
#### .map(), .apply(), .applymap()

The map() method only works on pandas series

The apply () method works on panda series and data frames

The applymap() method works on the entire pandas data frame where the input function is applied to every element individually. In other words, applymap() is appy() + map()!

a series is anothe name for a column. If you call on a column, your asking for the series

In [69]:
df = pd.read_csv('heart.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [74]:
new_df = df.age.map(lambda x: x * 10)
type(new_df)

pandas.core.series.Series

In [76]:
super_df = df.age.apply(lambda x: x * 10)
type(super_df)
super_df

0      630
1      370
2      410
3      560
4      570
      ... 
298    570
299    450
300    680
301    570
302    570
Name: age, Length: 303, dtype: int64

In [75]:
ultra_df = df.age.applymap(lmabda x: x* 10)

SyntaxError: invalid syntax (<ipython-input-75-3f1a3d712aa8>, line 1)

#### How else could we manipulate the 'age' column?

In [78]:
my_df = df.age.mean()
my_df

54.366336633663366

In [81]:
my_new_df = df.age.map(lambda x: x+5)
my_new_df.head()

0    68
1    42
2    46
3    61
4    62
Name: age, dtype: int64

Do these two dataframes equal the same thing?

In [82]:
new_ultra_df = df.applymap(lambda x: x*10)
new_ultra_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,630,10,30,1450,2330,10,0,1500,0,23.0,0,0,10,10
1,370,10,20,1300,2500,0,10,1870,0,35.0,0,0,20,10
2,410,0,10,1300,2040,0,0,1720,0,14.0,20,0,20,10
3,560,10,10,1200,2360,0,10,1780,0,8.0,20,0,20,10
4,570,0,0,1200,3540,0,10,1630,10,6.0,20,0,20,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,570,0,0,1400,2410,0,10,1230,10,2.0,10,0,30,0
299,450,10,30,1100,2640,0,10,1320,0,12.0,10,0,30,0
300,680,10,0,1440,1930,10,10,1410,0,34.0,10,20,30,0
301,570,10,0,1300,1310,0,10,1150,10,12.0,10,10,30,0


In [83]:
(new_df == super_df).value_counts()

True    303
Name: age, dtype: int64

In [84]:
apply_df = df.applymap(lambda x: x*10)

In [85]:
apply_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,630,10,30,1450,2330,10,0,1500,0,23.0,0,0,10,10
1,370,10,20,1300,2500,0,10,1870,0,35.0,0,0,20,10
2,410,0,10,1300,2040,0,0,1720,0,14.0,20,0,20,10
3,560,10,10,1200,2360,0,10,1780,0,8.0,20,0,20,10
4,570,0,0,1200,3540,0,10,1630,10,6.0,20,0,20,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,570,0,0,1400,2410,0,10,1230,10,2.0,10,0,30,0
299,450,10,30,1100,2640,0,10,1320,0,12.0,10,0,30,0
300,680,10,0,1440,1930,10,10,1410,0,34.0,10,20,30,0
301,570,10,0,1300,1310,0,10,1150,10,12.0,10,10,30,0


In [115]:
!pip install faker



In [117]:
from faker import Faker
fake = Faker()
fake.name()

'Amy Oconnell'

Let's generate a column of names

In [125]:
df['name'] = [fake.name() for x in range(303)]
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,name
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,Courtney Reese
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,Courtney Nguyen
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,Karen Benson
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,Katherine Lloyd
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,Katrina Hatfield


In [126]:
df.name = df.name.map(lambda x: x.replace(" ",'_'))
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,name
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,Courtney_Reese
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,Courtney_Nguyen
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,Karen_Benson
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,Katherine_Lloyd
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,Katrina_Hatfield


## BONUS:
##### Can you grab the row with info on Brent Jones?

In [138]:
courtney_reese = df.loc[df.name=='Courtney_Reese']
brent_jones = df.loc[df.name=='Brent_Jones']
courtney_reese #Will result in a result
brent_jones # will not result in a result because there is no brent jones

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,name
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,Courtney_Reese


Answer Below

In [None]:
brent_jones = df.loc[df['name'] == 'Brent_Jones']