# Pandas Tips & Review
### As a data scientist, you will be utilizing the pandas library in various ways to solve some of the world's most pressing issues

  
### Here are some useful pandas methods to manipulate dataframes to do and show us whatever we want!

![Alt Text](https://media.giphy.com/media/aUhEBE0T8XNHa/giphy.gif)

In [1]:
import pandas as pd
import numpy as np

## Pandas Display Settings

In [4]:
absent = pd.read_csv('Absenteeism_at_work.csv', delimiter=';')
absent

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
735,11,14,7,3,1,289,36,13,33,264.604,...,0,1,2,1,0,1,90,172,30,8
736,1,11,7,3,1,235,11,14,37,264.604,...,0,3,1,0,0,1,88,172,29,4
737,4,0,0,3,1,118,14,13,40,271.219,...,0,1,1,1,0,8,98,170,34,0
738,8,0,0,4,2,231,35,14,39,271.219,...,0,1,2,1,0,2,100,170,35,0


Notice how the columns and rows are truncated?  Let's work on fixing this using pandas display settings!

In [11]:
#viewing our current settings
print(pd.options.display.max_columns)
print(pd.options.display.max_rows)

50
None


In [27]:
#adjusting settings
pd.options.display.max_columns = 20
pd.options.display.max_rows = 760
pd.options.display.min_rows=2

#Marisa was confused because she doesn't realize that max_rows doesn't kick in if truncation is happening, at that point
#min rows kicks in. 

In [28]:
#reviewing our settings
print(pd.options.display.max_columns)
print(pd.options.display.max_rows)

20
760


In [29]:
#checking our dataframe again to see if our settings worked
absent

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2
5,3,23,7,6,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
6,10,22,7,6,1,361,52,3,28,239.554,...,0,1,1,1,0,4,80,172,27,8
7,20,23,7,6,1,260,50,11,36,239.554,...,0,1,4,1,0,0,65,168,23,4
8,14,19,7,2,1,155,12,14,34,239.554,...,0,1,2,1,0,0,95,196,25,40
9,1,22,7,2,1,235,11,14,37,239.554,...,0,3,1,0,0,1,88,172,29,8


## Working with multiple dataframes

#### One way to combine two datasets together is to use the the merge function from pandas

Let's take a look at the `.merge()` documentation in pandas before we begin [pandas.DataFrame.merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

In [2]:
df1 = pd.read_csv('heart.csv')
df1.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [3]:
df2 = pd.DataFrame(np.random.randint(0,100,size=(1025,3)), columns=['B','C','D'])
df2.head()

Unnamed: 0,B,C,D
0,48,58,67
1,54,38,98
2,39,72,60
3,37,8,76
4,34,71,85


In [4]:
#merge df1 and df2
pd.options.display.min_rows=10
df1.merge(df2,left_index=True,right_index=True)

merged_df=pd.merge(df1,df2,left_index=True,right_index=True)

####  Another method to join two dataframes is to use the `.join` method

[pandas.DataFrame.join](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html)

![](img/join-types.jpg)

In [5]:
#join df1 and df2
joined_df=df1.join(df2, on=df1.index)# I ***THINK*** the on might be redundant?)
joined_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,B,C,D
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0,48,58,67
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0,54,38,98
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0,39,72,60
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0,37,8,76
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0,34,71,85


#### We can also join two dataframes using the `.concat` function.  This function allows us to add new rows to our dataframe or new columns.

[pandas.concat documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

In [6]:
# creating a new dataframe with columns
df4 = pd.DataFrame(np.random.randint(0,100,size=(1025,3)), columns=['Chemical_A','Chemical_B','Chemical_C'])
df4.head()

Unnamed: 0,Chemical_A,Chemical_B,Chemical_C
0,48,77,26
1,80,50,53
2,6,32,67
3,78,44,4
4,59,88,45


In [7]:
#adding new columns to dataframe using concat
df5=pd.concat([merged_df,df4],axis=1)

df5

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,B,C,D,Chemical_A,Chemical_B,Chemical_C
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0,48,58,67,48,77,26
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0,54,38,98,80,50,53
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0,39,72,60,6,32,67
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0,37,8,76,78,44,4
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0,34,71,85,59,88,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1,53,11,69,22,98,91
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0,86,83,36,24,39,23
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0,83,31,66,69,55,53
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1,12,7,96,50,5,88


In [8]:
#creating a new row

df6 = pd.DataFrame({'age':[54, 72], 'sex':[1,0], "threstbps":[150, 143], 'chol':[276, 212], \
                    'fbs':[0, 1], 'restecg': [1, 0], 'thalach':[146, 171], 'exang':[0, 0], \
                    'oldpeak': [0.5, 2.9], 'slope':[1, 2], "ca" :[2, 0], "thal":[2, 2], \
                    "target": [1, 0], 'B':[25, 58], "C":[23, 84], "D":[40, 57], "Chemical_A":[3, 84], \
                    "Chemical_B":[43, 71], "Chemical_C": [45, 92]})


In [9]:
conda update pandas

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.




  current version: 4.8.2
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base -c defaults conda




In [10]:
#concat row
pd.concat([df5,df6],axis=1)
#pandas made me hit in sort=True or sort=False.  Not sure what they do her or which is right. 

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,...,slope,ca,thal,target,B,C,D,Chemical_A,Chemical_B,Chemical_C
0,52,1,0,125,212,0,1,168,0,1.0,...,1.0,2.0,2.0,1.0,25.0,23.0,40.0,3.0,43.0,45.0
1,53,1,0,140,203,1,0,155,1,3.1,...,2.0,0.0,2.0,0.0,58.0,84.0,57.0,84.0,71.0,92.0
2,70,1,0,145,174,0,1,125,1,2.6,...,,,,,,,,,,
3,61,1,0,148,203,0,1,161,0,0.0,...,,,,,,,,,,
4,62,0,0,138,294,1,1,106,0,1.9,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,...,,,,,,,,,,
1021,60,1,0,125,258,0,0,141,1,2.8,...,,,,,,,,,,
1022,47,1,0,110,275,0,0,118,1,1.0,...,,,,,,,,,,
1023,50,0,0,110,254,0,0,159,0,0.0,...,,,,,,,,,,


# Ways to utilize lambda functions
#### .map(), .apply(), .applymap()

The map() method only works on pandas series

The apply () method works on panda series and data frames

The applymap() method works on the entire pandas data frame where the input function is applied to every element individually. In other words, applymap() is appy() + map()!

In [38]:
df = pd.read_csv('heart.csv')
df.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [39]:
new_df = df.age.map(lambda x: x * 10)
new_df.head()

0    520
1    530
2    700
3    610
4    620
Name: age, dtype: int64

In [40]:
super_df = df.age.apply(lambda x: x * 10)
super_df.head()

0    520
1    530
2    700
3    610
4    620
Name: age, dtype: int64

Do these two dataframes equal the same thing?

In [41]:
(new_df == super_df).value_counts()

True    1025
Name: age, dtype: int64

In [42]:
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


In [43]:
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


In [58]:
apply_df = df.applymap(lambda x: x*10)

In [59]:
apply_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,name
0,520,10,0,1250,2120,0,10,1680,0,10.0,20,20,30,0,Megan_BryantMegan_BryantMegan_BryantMegan_Brya...
1,530,10,0,1400,2030,10,0,1550,10,31.0,0,0,30,0,Justin_HuntJustin_HuntJustin_HuntJustin_HuntJu...
2,700,10,0,1450,1740,0,10,1250,10,26.0,0,0,30,0,John_MorseJohn_MorseJohn_MorseJohn_MorseJohn_M...
3,610,10,0,1480,2030,0,10,1610,0,0.0,20,10,30,0,Isaac_GoodmanIsaac_GoodmanIsaac_GoodmanIsaac_G...
4,620,0,0,1380,2940,10,10,1060,0,19.0,10,30,20,0,Joshua_HallJoshua_HallJoshua_HallJoshua_HallJo...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,590,10,10,1400,2210,0,10,1640,10,0.0,20,0,20,10,Tanya_WilliamsTanya_WilliamsTanya_WilliamsTany...
1021,600,10,0,1250,2580,0,0,1410,10,28.0,10,10,30,0,Angela_JamesAngela_JamesAngela_JamesAngela_Jam...
1022,470,10,0,1100,2750,0,0,1180,10,10.0,10,10,20,0,James_HillJames_HillJames_HillJames_HillJames_...
1023,500,0,0,1100,2540,0,0,1590,0,0.0,20,0,20,10,Brandon_FreemanBrandon_FreemanBrandon_FreemanB...


#### How else could we manipulate the 'age' column?

#### We can also use these functions on strings.  Let's add a column of patient names to our dataset and manipulate those using a lambda function!

In [19]:
!pip install faker



In [48]:
from faker import Faker
fake = Faker()
fake.name()

'Michael Andersen'

Let's generate a column of names

In [55]:
#creating patient names
df['name'] = [fake.name() for x in range(1025)]
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,name
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0,Megan Bryant
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0,Justin Hunt
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0,John Morse
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0,Isaac Goodman
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0,Joshua Hall


In [60]:
df_test= pd.DataFrame([[1, 2.12], [3.356, 4.567]])
df.applymap(lambda x: len(str(x)))

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,name
0,2,1,1,3,3,1,1,3,1,3,1,1,1,1,12
1,2,1,1,3,3,1,1,3,1,3,1,1,1,1,11
2,2,1,1,3,3,1,1,3,1,3,1,1,1,1,10
3,2,1,1,3,3,1,1,3,1,3,1,1,1,1,13
4,2,1,1,3,3,1,1,3,1,3,1,1,1,1,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,2,1,1,3,3,1,1,3,1,3,1,1,1,1,14
1021,2,1,1,3,3,1,1,3,1,3,1,1,1,1,12
1022,2,1,1,3,3,1,1,3,1,3,1,1,1,1,10
1023,2,1,1,3,3,1,1,3,1,3,1,1,1,1,15


In [61]:
#replacing the spaces in each name with an _

df['name']=df.name.apply(lambda x: x.replace(' ','_'))
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,name
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0,Megan_Bryant
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0,Justin_Hunt
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0,John_Morse
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0,Isaac_Goodman
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0,Joshua_Hall


# Transposing Data

There are a variety of ways you might want to transpose your dataframe.  One thing you might want to do is turn columns into rows.  You can do this very easily using the `.transpose` or `.t` function.

[Pandas Transpose](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html)

In [158]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024
age,52,53,70,61,62,58,58,55,46,54,...,58,65,53,41,47,59,60,47,50,54
sex,1,1,1,1,0,0,1,1,1,1,...,1,1,1,1,1,1,1,1,0,1
cp,0,0,0,0,0,0,0,0,0,0,...,0,3,0,0,0,1,0,0,0,0
trestbps,125,140,145,148,138,100,114,160,120,122,...,128,138,123,110,112,140,125,110,110,120
chol,212,203,174,203,294,248,318,289,249,286,...,216,282,282,172,204,221,258,275,254,188
fbs,0,1,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
restecg,1,0,1,1,1,0,2,0,0,0,...,0,0,1,0,1,1,0,0,0,1
thalach,168,155,125,161,106,122,140,145,144,116,...,131,174,95,158,143,164,141,118,159,113
exang,0,1,1,0,0,0,0,1,0,1,...,1,0,1,0,0,1,1,1,0,0
oldpeak,1,3.1,2.6,0,1.9,1,4.4,0.8,0.8,3.2,...,2.2,1.4,2,0,0.1,0,2.8,1,0,1.4


### Pivot tables which you might be familiar with are also possible to create in pandas using the `.pivot_table()` function

[pivot_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html)

In [68]:
df.pivot_table(index = 'sex' , columns='cp' , values='chol', aggfunc='mean')

5.median()

SyntaxError: invalid syntax (<ipython-input-68-770c81023517>, line 3)

In [67]:
#let's add some margins
df.pivot_table(index = 'sex', columns= 'cp', values="chol", aggfunc=lambda x: 2*(x.median()),margins=True)

cp,0,1,2,3,All
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,516,500,512,480,508
1,486,470,462,466,468
All,494,472,466,468,480


### Another way to view information across two or more factors is to create a cross tabulation using the pandas `.crosstab` function

[Pandas Crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html)

In [62]:
pd.crosstab(index=[df.sex, df.cp], columns=df.exang, margins=True)

Unnamed: 0_level_0,exang,0,1,All
sex,cp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.0,71,62,133
0,1.0,51,6,57
0,2.0,103,6,109
0,3.0,13,0,13
1,0.0,143,221,364
1,1.0,104,6,110
1,2.0,144,31,175
1,3.0,51,13,64
All,,680,345,1025


In [26]:
df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])

In [27]:
df.applymap(lambda x: x+3)

Unnamed: 0,0,1
0,4.0,5.12
1,6.356,7.567


In [78]:
df.iloc[1:4]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,name
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0,Justin_Hunt
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0,John_Morse
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0,Isaac_Goodman


In [73]:
df.loc[:,['age','sex']]

Unnamed: 0,age,sex
0,52,1
1,53,1
2,70,1
3,61,1
4,62,0
...,...,...
1020,59,1
1021,60,1
1022,47,1
1023,50,0


In [84]:
import numpy as np
df=pd.DataFrame(np.random.randn(3,3),index=['first','second','third'],columns=['pos','a','b'])

df

Unnamed: 0,pos,a,b
first,0.312536,-2.029674,-0.736669
second,0.910412,-0.03487,0.152453
third,-1.336721,-0.219357,-1.368103


In [88]:
list(range(1,4))

[1, 2, 3]

In [89]:
df.astype({'a':'int'})

Unnamed: 0,pos,a,b
first,0.312536,-2,-0.736669
second,0.910412,0,0.152453
third,-1.336721,0,-1.368103
