# Useful Methods
* apply() method
* apply() with a function
* apply() with a lambda expression
* apply() on multiple columns
* describe()
* sort_values()
* corr()
* idxmin and idxmax
* value_counts
* replace
* unique and nunique
* map
* duplicated and drop_duplicates
* between
* sample
* nlargest


## The .apply() method
This allows us to apply and broadcast custom functions on a DataFrame column

In [1]:
import pandas as pd
import numpy as np

In [2]:
from google.colab import files
uploaded = files.upload()

Saving tips.csv to tips.csv


In [3]:
df = pd.read_csv('tips.csv')

In [4]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251



## apply with a function

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB


In [6]:
#there's no real buil-in method in Pandas that is gonna be able to both convert to a string
#and then do this sort of exact slice of notation.
'''
this is where the apply method comes in. it allows you to create a custom function.
and then apply it to a series in Pandas.
'''
str(123456456)[-4:]

'6456'

In [7]:
#lets functionalize this method
def last_four(num):
  return str(num)[-4:]

last_four(123456456) #check that it is working

'6456'

In [8]:
# now apply last four, for every single row, which are all integers
# CC Number         244 non-null    int64
df["CC Number"].apply(last_four)

Unnamed: 0,CC Number
0,3410
1,9230
2,1322
3,5994
4,7221
...,...
239,2842
240,5404
241,7196
242,0950


In [9]:
# because it's a series, I can easily make a new column called like last_four
df["last_four"] = df["CC Number"].apply(last_four)

## Using .apply() with more complex functions

In [10]:
'''
maybe I am working for review site like Google Reviews, if you see site like that,
you''ll notice that for how pricey or yelp a restaurant is, they usually report back if it's a
dollar sign two dollar sign or three. Now we just have the total bill price.
Go head and create a custom function that takes in a single price and then returns
either a single dollar sign, two dollar signs or three dollar signs based off that price

'''

"\nmaybe I am working for review site like Google Reviews, if you see site like that,\nyou''ll notice that for how pricey or yelp a restaurant is, they usually report back if it's a\ndollar sign two dollar sign or three. Now we just have the total bill price.\nGo head and create a custom function that takes in a single price and then returns\neither a single dollar sign, two dollar signs or three dollar signs based off that price\n\n"

In [11]:
df["total_bill"].mean()

np.float64(19.78594262295082)

In [12]:
def yelp(price):
  if price < 10:
    return "$"
  elif price >= 10 and price < 30:
    return "$$"
  else:
    return "$$$"

In [13]:
df["yelp"] = df["total_bill"].apply(yelp)

df["yelp"]

'''
these apply functions should only return a single value. Because when we think about
actually happening here is they're taking in a single value, which is the cell value
for every row in this total bill column.
And what they should be returning is a single value, like a single tring,int or etc.

What you should not be doing is returning some sort of entries here.
It should just be returning one single value because it's gonna be a single value per row.

'''

"\nthese apply functions should only return a single value. Because when we think about\nactually happening here is they're taking in a single value, which is the cell value\nfor every row in this total bill column.\nAnd what they should be returning is a single value, like a single tring,int or etc.\n\nWhat you should not be doing is returning some sort of entries here.\nIt should just be returning one single value because it's gonna be a single value per row.\n\n"

In [14]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,1322,$$
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$
...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,2842,$$
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,5404,$$
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,7196,$$
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,0950,$$


## apply with lambda

In [15]:
# What if I have a function that is dependent on two columns?
'''
these custom functions using the apply method can actually operate based off multiple inputs,
not just two columns however many columns you want
'''

'\nthese custom functions using the apply method can actually operate based off multiple inputs,\nnot just two columns however many columns you want\n'

In [16]:
def simple(num):
  return num*2

simple(2)

4

In [17]:
#lambda takes in a value and then returns some operations on that.
#not everything can be converted to a lambda expression.
#we're gonna be using a Lambda expression in order to actually format our apply method.
lambda num: num*2

<function __main__.<lambda>(num)>

In [18]:
# how to apply a single Lambda expression to a column
#df["total_bill"].apply(simple)
df["total_bill"].apply(lambda num: num*2)

Unnamed: 0,total_bill
0,33.98
1,20.68
2,42.02
3,47.36
4,49.18
...,...
239,58.06
240,54.36
241,45.34
242,35.64


## apply that uses multiple columns

Note, there are several ways to do this:

https://stackoverflow.com/questions/19914937/applying-function-with-multiple-arguments-to-create-a-new-pandas-column

In [19]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$


In [20]:
def quality(total_bill, tip):
  if tip/total_bill > 0.25:
    return "Generous"
  else:
    return "Other"

quality(16.99, 1.01)

# I want to apply this function and it's gonna have to take in the inputs of total bill and tip.
# So what I need to do format this using a Lambda expression.

'Other'

In [21]:
df[["total_bill", "tip"]].apply(lambda df: quality(df["total_bill"], df["tip"]), axis=1)

Unnamed: 0,0
0,Other
1,Other
2,Other
3,Other
4,Other
...,...
239,Other
240,Other
241,Other
242,Other


In [22]:
#assign this to an actual column
df["Quality"] = df[["total_bill", "tip"]].apply(lambda df: quality(df["total_bill"], df["tip"]), axis=1)
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,1322,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,2842,$$,Other
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,5404,$$,Other
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,7196,$$,Other
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,0950,$$,Other


In [23]:
# way to make this run a lot faster, calling np.vectorize
df["Quality"] = np.vectorize(quality)(df["total_bill"],df["tip"])
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,1322,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,2842,$$,Other
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,5404,$$,Other
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,7196,$$,Other
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,0950,$$,Other


In [24]:
# let's test this out using timeit module
import timeit
# timeit takes in two things:
# takes in setup code and then statement code.

In [25]:
# code snippet to be executed only once
setup = '''
import numpy as np
import pandas as pd
df = pd.read_csv('tips.csv')
def quality(total_bill,tip):
    if tip/total_bill  > 0.25:
        return "Generous"
    else:
        return "Other"
'''
# timeit, takes in a bunch of code as multi-line strings for setup

In [26]:
# code snippet whose execution time is to be measured
stmt_one = '''
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
'''

stmt_two = '''
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])
'''
# and then the statements you actually want to run after you run the setup code
# and then it just times it, runs it for multiple loops and reports back how long it took.

In [27]:
timeit.timeit(setup=setup, stmt=stmt_one, number=1000)

3.8250229580000052

In [28]:
timeit.timeit(setup=setup, stmt=stmt_two, number=1000) #much faster

0.4193255499999964

## df.describe for statistical summaries

In [29]:
df.describe()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


In [32]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_bill,244.0,19.78594,8.902412,3.07,13.3475,17.795,24.1275,50.81
tip,244.0,2.998279,1.383638,1.0,2.0,2.9,3.5625,10.0
size,244.0,2.569672,0.9510998,1.0,2.0,2.0,3.0,6.0
price_per_person,244.0,7.888197,2.914234,2.88,5.8,7.255,9.39,20.27
CC Number,244.0,2563496000000000.0,2369340000000000.0,60406790000.0,30407310000000.0,3525318000000000.0,4553675000000000.0,6596454000000000.0


## sort_values()

In [36]:
df.sort_values("tip",ascending=True) #sort by tip column
# by default it goes an ascending order from lowest to highest

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
92,5.75,1.00,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780,6392,$,Other
111,7.25,1.00,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801,6887,$,Other
67,3.07,1.00,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455,5267,$,Generous
236,12.60,1.00,Male,Yes,Sat,Dinner,2,6.30,Matthew Myers,3543676378973965,Sat5032,3965,$$,Other
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,34.30,6.70,Male,No,Thur,Lunch,6,5.72,Steven Carlson,3526515703718508,Thur1025,8508,$$$,Other
59,48.27,6.73,Male,No,Sat,Dinner,4,12.07,Brian Ortiz,6596453823950595,Sat8139,0595,$$$,Other
23,39.42,7.58,Male,No,Sat,Dinner,4,9.86,Lance Peterson,3542584061609808,Sat239,9808,$$$,Other
212,48.33,9.00,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590,5212,$$$,Other


In [37]:
df.sort_values(["tip","size"]) #sort by multiple columns

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
67,3.07,1.00,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455,5267,$,Generous
111,7.25,1.00,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801,6887,$,Other
92,5.75,1.00,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780,6392,$,Other
236,12.60,1.00,Male,Yes,Sat,Dinner,2,6.30,Matthew Myers,3543676378973965,Sat5032,3965,$$,Other
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,34.30,6.70,Male,No,Thur,Lunch,6,5.72,Steven Carlson,3526515703718508,Thur1025,8508,$$$,Other
59,48.27,6.73,Male,No,Sat,Dinner,4,12.07,Brian Ortiz,6596453823950595,Sat8139,0595,$$$,Other
23,39.42,7.58,Male,No,Sat,Dinner,4,9.86,Lance Peterson,3542584061609808,Sat239,9808,$$$,Other
212,48.33,9.00,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590,5212,$$$,Other


## idxmin and idxmax

In [48]:
df["total_bill"].max()

50.81

In [39]:
#index location of where this max value is happening
df["total_bill"].idxmax()

170

In [42]:
df.iloc[170]

Unnamed: 0,170
total_bill,50.81
tip,10.0
sex,Male
smoker,Yes
day,Sat
time,Dinner
size,3
price_per_person,16.94
Payer Name,Gregory Clark
CC Number,5473850968388236


## df.corr() for correlation checks

In [45]:
#pairwise correlation of columns, excluding NA/null values.
#how correlated column values are of each other
#this in general only works with numeric columns
#df.corr()
df.select_dtypes(include=np.number).corr()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
total_bill,1.0,0.675734,0.598315,0.647554,0.104576
tip,0.675734,1.0,0.489299,0.347405,0.110857
size,0.598315,0.489299,1.0,-0.175359,-0.030239
price_per_person,0.647554,0.347405,-0.175359,1.0,0.13524
CC Number,0.104576,0.110857,-0.030239,0.13524,1.0


In [46]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other


## value_counts
Nice method to quickly get a count per category. Only makes sense on categorical columns.

In [49]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other


In [47]:
df["sex"].value_counts()

Unnamed: 0_level_0,count
sex,Unnamed: 1_level_1
Male,157
Female,87


## unique

In [50]:
df["day"].unique()

array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

In [51]:
#number of unique values
df["day"].nunique()
# same with len(df["day"].unique())

4

In [57]:
df['time'].unique()

array(['Dinner', 'Lunch'], dtype=object)

In [52]:
df["day"].value_counts()

Unnamed: 0_level_0,count
day,Unnamed: 1_level_1
Sat,87
Sun,76
Thur,62
Fri,19


## replace

Quickly replace values with another one.

In [53]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other


In [56]:
# I wanted to replace female and male with just the letters F and M
df["sex"].replace(["Female","Male"],["F","M"])
# fewer items you intend to replace

Unnamed: 0,sex
0,F
1,M
2,M
3,M
4,F
...,...
239,M
240,F
241,M
242,M


## map

In [58]:
mymap = {"Female":"F","Male":"M"}

In [59]:
#if you intend to replace lots of items
df["sex"].map(mymap)

Unnamed: 0,sex
0,F
1,M
2,M
3,M
4,F
...,...
239,M
240,F
241,M
242,M


## Duplicates

### .duplicated() and .drop_duplicates()

In [60]:
df.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
239,False
240,False
241,False
242,False


In [62]:
simple_df = pd.DataFrame([1,2,2],["a","b","c"])
simple_df

Unnamed: 0,0
a,1
b,2
c,2


In [63]:
simple_df.duplicated()

Unnamed: 0,0
a,False
b,False
c,True


In [64]:
simple_df.drop_duplicates()

Unnamed: 0,0
a,1
b,2


## between

left: A scalar value that defines the left boundary
right: A scalar value that defines the right boundary
inclusive: A Boolean value which is True by default. If False, it excludes the two passed arguments while checking.

In [67]:
df["total_bill"].between(10,20,inclusive='both')

Unnamed: 0,total_bill
0,True
1,True
2,False
3,False
4,False
...,...
239,False
240,False
241,False
242,True


## nlargest and nsmallest

In [69]:
# 2 rows that have largest tip
df.nlargest(2, "tip")
# df.sort_values("tip",ascending=False).iloc[0:2]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
170,50.81,10.0,Male,Yes,Sat,Dinner,3,16.94,Gregory Clark,5473850968388236,Sat1954,8236,$$$,Other
212,48.33,9.0,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590,5212,$$$,Other


## sample

In [70]:
# 5 random rows
df.sample(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
14,14.83,3.02,Female,No,Sun,Dinner,2,7.42,Vanessa Jones,30016702287574,Sun3848,7574,$$,Other
18,16.97,3.5,Female,No,Sun,Dinner,3,5.66,Laura Martinez,30422275171379,Sun2789,1379,$$,Other
216,28.15,3.0,Male,Yes,Sat,Dinner,5,5.63,Shawn Barnett PhD,4590982568244,Sat7320,8244,$$,Other
51,10.29,2.6,Female,No,Sun,Dinner,2,5.14,Jessica Ibarra,4999759463713,Sun4474,3713,$$,Generous
186,20.9,3.5,Female,Yes,Sun,Dinner,3,6.97,Heidi Atkinson,4422858423131187,Sun4254,1187,$$,Other


In [71]:
# grab 10% of rows in that table randomly
df.sample(frac=0.1)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,yelp,Quality
35,24.06,3.6,Male,No,Sat,Dinner,3,8.02,Joseph Mullins,5519770449260299,Sat632,299,$$,Other
230,24.01,2.0,Male,Yes,Sat,Dinner,4,6.0,Michael Osborne,4258682154026,Sat7872,4026,$$,Other
215,12.9,1.1,Female,Yes,Sat,Dinner,2,6.45,Jessica Owen,4726904879471,Sat6983,9471,$$,Other
172,7.25,5.15,Male,Yes,Sun,Dinner,2,3.62,Larry White,30432617123103,Sun9209,3103,$,Generous
83,32.68,5.0,Male,Yes,Thur,Lunch,2,16.34,Daniel Murphy,5356177501009133,Thur8801,9133,$$$,Other
188,18.15,3.5,Female,Yes,Sun,Dinner,3,6.05,Glenda Wiggins,578329325307,Sun430,5307,$$,Other
101,15.38,3.0,Female,Yes,Fri,Dinner,2,7.69,Tiffany Colon,6011012799432041,Fri8382,2041,$$,Other
204,20.53,4.0,Male,Yes,Thur,Lunch,4,5.13,Scott Kim,3570611756827620,Thur2160,7620,$$,Other
130,19.08,1.5,Male,No,Thur,Lunch,2,9.54,Seth Sexton,213113680829581,Thur1446,9581,$$,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other
