## Useful Methods
Let's cover some useful methods and functions built in to pandas. This is actually just a small sampling of the functions and methods available in Pandas, but they are some of the most commonly used. The documentation(https://pandas.pydata.org/pandas-docs/stable/reference/index.html) is a great resource to continue exploring more methods and functions (we will introduce more further along in the course). Here is a list of functions and methods we'll cover here (click on one to jump to that section in this notebook.):

apply() method

apply() with a function

apply() with a lambda expression

apply() on multiple columns

describe()

sort_values()

corr()

idxmin and idxmax

value_counts

replace

unique and nunique

map

duplicated and drop_duplicates

between

sample

nlargest

Make sure to view the video lessons to get the full explanation!

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("tips.csv")

In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251



## The .apply() method

Here we will learn about a very useful method known as **apply** on a DataFrame. This allows us to apply and broadcast custom functions on a DataFrame column

### apply with a function

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB


In [5]:
# in order to get last four number of CC(credit cards)

# 2313456413213 int value old. için slicelama yapamayız. Bunu yapabilmek için

# Bunu yapabilmek için önce str e çevirmeliyiz.

# str(2313456413213)[-4:]

# Bu işlemi pandas DataFrame lere uygulayabilmek için önce bir fonksiyon yazmalı ve 

# sonra da apply() fonksiyonu ile bu fonksiyonu DataFrame'in istediğim bölümüne uygulamalıyım.

In [6]:
def last_four(num):
    return str(num)[-4:]

In [7]:
df["CC Number"][0]

3560325168603410

In [8]:
last_four(df["CC Number"][0])

'3410'

In [9]:
df["CC Number"].apply(last_four)

0      3410
1      9230
2      1322
3      5994
4      7221
       ... 
239    2842
240    5404
241    7196
242    0950
243    8139
Name: CC Number, Length: 244, dtype: object

In [10]:
df["last_four"] = df["CC Number"].apply(last_four)

In [11]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221


In [12]:
df["total_bill"]  # total_bill değerlerini belirli bir kritere göre $, $$, $$$ işaretlerine dönüştürmek için

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

### Using .apply() with more complex functions

In [13]:
def yelp(price):
    if price < 10 :
        return "$"
    elif price >= 10 and price <30 :
        return "$$"
    else:
        return "$$$"

In [14]:
df["total_bill"].apply(yelp)

0      $$
1      $$
2      $$
3      $$
4      $$
       ..
239    $$
240    $$
241    $$
242    $$
243    $$
Name: total_bill, Length: 244, dtype: object

In [15]:
df["expensive"] = df["total_bill"].apply(yelp)

df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$


In [16]:
df[df["total_bill"].apply(yelp).isin(["$"])].head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive
6,8.77,2.0,Male,No,Sun,Dinner,2,4.38,Kristopher Johnson,2223727524230344,Sun5985,344,$
30,9.55,1.45,Male,No,Sat,Dinner,2,4.78,Grant Hall,30196517521548,Sat4099,1548,$
43,9.68,1.32,Male,No,Sun,Dinner,2,4.84,Christopher Spears,4387671121369212,Sun3279,9212,$
53,9.94,1.56,Male,No,Sun,Dinner,2,4.97,Curtis Morgan,4628628020417301,Sun4561,7301,$
67,3.07,1.0,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455,5267,$


### apply with lambda

In [17]:
lambda num : num*2

<function __main__.<lambda>(num)>

In [18]:
df["total_bill"].apply(lambda bill : bill*0.18)

0      3.0582
1      1.8612
2      3.7818
3      4.2624
4      4.4262
        ...  
239    5.2254
240    4.8924
241    4.0806
242    3.2076
243    3.3804
Name: total_bill, Length: 244, dtype: float64

## apply that uses multiple columns

Note, there are several ways to do this:

https://stackoverflow.com/questions/19914937/applying-function-with-multiple-arguments-to-create-a-new-pandas-column

In [19]:
def quality(tot_bill, tip):
    if tip/tot_bill > 0.25 :
        return "Generous"
    else :
        return "Other"
    

In [20]:
df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)

0      Other
1      Other
2      Other
3      Other
4      Other
       ...  
239    Other
240    Other
241    Other
242    Other
243    Other
Length: 244, dtype: object

In [21]:
df["Tip Quality"] = df[["total_bill","tip"]].apply(lambda df: quality(df["total_bill"],df["tip"]), axis=1)

In [22]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other


In [23]:
# np.vectorized() fonksiyonu ile

np.vectorize(quality)(df["total_bill"], df["tip"])

array(['Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Generous', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Generous', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other',
       'Other', 'Other', 'Generous', 'Other', 'Other', 'Other', 'Other',
       'Oth

In [24]:
df["Quality"] = np.vectorize(quality)(df["total_bill"], df["tip"])

In [25]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other,Other


### So, which one is faster?

In [26]:
import timeit

# code snippet to be executed only once 
setup = '''
import numpy as np
import pandas as pd
df = pd.read_csv('tips.csv')
def quality(total_bill,tip):
    if tip/total_bill  > 0.25:
        return "Generous"
    else:
        return "Other"
'''

# code snippet whose execution time is to be measured 
stmt_one = ''' 
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
'''

stmt_two = '''
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])
'''
  

In [27]:
timeit.timeit(setup=setup, stmt=stmt_one, number=1000)

3.8014769000000115

In [28]:
timeit.timeit(setup= setup, stmt=stmt_two, number = 1000)   # vectorize fonksiyonu daha hızlı işlem yapılmasını sağlıyor.

0.21112919999995938

Wow! Vectorization is much faster! Keep np.vectorize() in mind for the future.

Full Details: https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html

## Statistical Information and Sorting

## for statistical summaries

In [29]:
df.describe()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


In [30]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_bill,244.0,19.78594,8.902412,3.07,13.3475,17.795,24.1275,50.81
tip,244.0,2.998279,1.383638,1.0,2.0,2.9,3.5625,10.0
size,244.0,2.569672,0.9510998,1.0,2.0,2.0,3.0,6.0
price_per_person,244.0,7.888197,2.914234,2.88,5.8,7.255,9.39,20.27
CC Number,244.0,2563496000000000.0,2369340000000000.0,60406790000.0,30407310000000.0,3525318000000000.0,4553675000000000.0,6596454000000000.0


## sort_values()

In [31]:
# tek bir columna göre sıralama yapmak için column name "" içine yazılması yeterlidir.

df.sort_values("tip").head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
67,3.07,1.0,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455,5267,$,Generous,Generous
236,12.6,1.0,Male,Yes,Sat,Dinner,2,6.3,Matthew Myers,3543676378973965,Sat5032,3965,$$,Other,Other
92,5.75,1.0,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780,6392,$,Other,Other
111,7.25,1.0,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801,6887,$,Other,Other
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other


In [32]:
# birden fazla columna göre sıralama yapmak için column nameleri köşeli parantez içinde yazılarak
# liste şeklinde fonksiyon içine yazılmalıdır. Sıralama işlemi liste içindeki sıraya göre yapılır.
# Yani önce ilk column name ine göre, sonra ikinci column nameine göre sıralama yapılır.
# parametre olarak ascending=False dersek azalan yönde, True dersek artan yönde sıralama yapar.

In [33]:
df.sort_values(["tip", "size"]).head()  # default ascending değeri True old. için artan yönde sıralama yaptı.

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
67,3.07,1.0,Female,Yes,Sat,Dinner,1,3.07,Tiffany Brock,4359488526995267,Sat3455,5267,$,Generous,Generous
111,7.25,1.0,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801,6887,$,Other,Other
92,5.75,1.0,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780,6392,$,Other,Other
236,12.6,1.0,Male,Yes,Sat,Dinner,2,6.3,Matthew Myers,3543676378973965,Sat5032,3965,$$,Other,Other
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other


In [34]:
df.sort_values(by=["tip", "size"], ascending=[False, True]).head()  

# tip değerlerini azalan, size değerlerini ise artan bir şekilde sıraladık.

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
170,50.81,10.0,Male,Yes,Sat,Dinner,3,16.94,Gregory Clark,5473850968388236,Sat1954,8236,$$$,Other,Other
212,48.33,9.0,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590,5212,$$$,Other,Other
23,39.42,7.58,Male,No,Sat,Dinner,4,9.86,Lance Peterson,3542584061609808,Sat239,9808,$$$,Other,Other
59,48.27,6.73,Male,No,Sat,Dinner,4,12.07,Brian Ortiz,6596453823950595,Sat8139,595,$$$,Other,Other
141,34.3,6.7,Male,No,Thur,Lunch,6,5.72,Steven Carlson,3526515703718508,Thur1025,8508,$$$,Other,Other


In [35]:
df.sort_index(axis=1, ascending=True).head()

# axis=1 dediğimizde columnlara(column namelere) göre sıralama yapar.
# ascending=True olunca artan yönde, False olunca azalan yönde sılarar.

Unnamed: 0,CC Number,Payer Name,Payment ID,Quality,Tip Quality,day,expensive,last_four,price_per_person,sex,size,smoker,time,tip,total_bill
0,3560325168603410,Christy Cunningham,Sun2959,Other,Other,Sun,$$,3410,8.49,Female,2,No,Dinner,1.01,16.99
1,4478071379779230,Douglas Tucker,Sun4608,Other,Other,Sun,$$,9230,3.45,Male,3,No,Dinner,1.66,10.34
2,6011812112971322,Travis Walters,Sun4458,Other,Other,Sun,$$,1322,7.0,Male,3,No,Dinner,3.5,21.01
3,4676137647685994,Nathaniel Harris,Sun5260,Other,Other,Sun,$$,5994,11.84,Male,2,No,Dinner,3.31,23.68
4,4832732618637221,Tonya Carter,Sun2251,Other,Other,Sun,$$,7221,6.15,Female,4,No,Dinner,3.61,24.59


In [36]:
df.sort_index(axis=0, ascending=False).head()

# axis=0 dediğimizde index name/numaralarına göre sıralama yapar.
# ascending=True olunca artan yönde, False olunca azalan yönde sılarar.

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
243,18.78,3.0,Female,No,Thur,Dinner,2,9.39,Michelle Hardin,3511451626698139,Thur672,8139,$$,Other,Other
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,950,$$,Other,Other
241,22.67,2.0,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,7196,$$,Other,Other
240,27.18,2.0,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,5404,$$,Other,Other
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,2842,$$,Other,Other


## df.corr() for correlation checks

In [37]:
df.corr()  # dataFrame deki verilerin birbirlerine göre korelasyon katsayılarını bulmak için kullanılır.

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
total_bill,1.0,0.675734,0.598315,0.647554,0.104576
tip,0.675734,1.0,0.489299,0.347405,0.110857
size,0.598315,0.489299,1.0,-0.175359,-0.030239
price_per_person,0.647554,0.347405,-0.175359,1.0,0.13524
CC Number,0.104576,0.110857,-0.030239,0.13524,1.0


In [38]:
# DataFrame deki belirli değişkenlerin birbirine göre korelasyon katsayılarını bulmak için ise aşağıdaki yöntem uygulanmalıdır.

df[["total_bill", "tip"]].corr()

Unnamed: 0,total_bill,tip
total_bill,1.0,0.675734
tip,0.675734,1.0


## idxmax and idxmin

In [39]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other,Other


In [40]:
df["total_bill"].max()  # total_bill column daki verilerin max. değerini bulmak için

50.81

In [41]:
df["total_bill"].idxmax()  # total_bill column daki max. değerin indexini bulmak için

170

In [42]:
df.iloc[170]  # total_bill max. değerin tüm bilgilerini görmek için

total_bill                     50.81
tip                             10.0
sex                             Male
smoker                           Yes
day                              Sat
time                          Dinner
size                               3
price_per_person               16.94
Payer Name             Gregory Clark
CC Number           5473850968388236
Payment ID                   Sat1954
last_four                       8236
expensive                        $$$
Tip Quality                    Other
Quality                        Other
Name: 170, dtype: object

In [43]:
df["total_bill"].min()  # total_bill column daki verilerin min. değerini bulmak için

3.07

In [44]:
df["total_bill"].idxmin()  # total_bill column daki min. değerin indexini bulmak için

67

In [45]:
df.iloc[67]  # # total_bill min. değerin tüm bilgilerini görmek için

total_bill                      3.07
tip                              1.0
sex                           Female
smoker                           Yes
day                              Sat
time                          Dinner
size                               1
price_per_person                3.07
Payer Name             Tiffany Brock
CC Number           4359488526995267
Payment ID                   Sat3455
last_four                       5267
expensive                          $
Tip Quality                 Generous
Quality                     Generous
Name: 67, dtype: object

In [46]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other,Other


## Single selections using iloc and DataFrame
## Rows:
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.

data.iloc[1] # second row of data frame (Evan Zigomalas)

data.iloc[-1] # last row of data frame (Mi Richan)

## Columns:
data.iloc[:,0] # first column of data frame (first_name)

data.iloc[:,1] # second column of data frame (last_name)

data.iloc[:,-1] # last column of data frame (id)

## Multiple row and column selections using iloc and DataFrame
data.iloc[0:5] # first five rows of dataframe

data.iloc[:, 0:2] # first two columns of data frame with all rows

data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.

data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).

## Selecting rows with a boolean / conditional lookup

In [47]:
df.loc[(df['total_bill']>5) & (df['size']==1)]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
82,10.07,1.83,Female,No,Thur,Lunch,1,10.07,Julie Moody,630413282843,Thur4909,2843,$$,Other,Other
111,7.25,1.0,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801,6887,$,Other,Other
222,8.58,1.92,Male,Yes,Fri,Lunch,1,8.58,Jason Lawrence,3505302934650403,Fri6624,403,$,Other,Other


In [48]:
df.loc[(df['total_bill']>5) & (df['size']==1), ["price_per_person","CC Number"]]

Unnamed: 0,price_per_person,CC Number
82,10.07,630413282843
111,7.25,3559221007826887
222,8.58,3505302934650403


## value_counts

Nice method to quickly get a count per category. Only makes sense on categorical columns.

In [49]:
# Kategori bazında sayıları bulmak için kullanılır. Kategorik kolonlarda kullanımı daha anlamlı olabilir.

In [50]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other,Other


In [51]:
df["smoker"].value_counts()

No     151
Yes     93
Name: smoker, dtype: int64

In [52]:
df["size"].value_counts()

2    156
3     38
4     37
5      5
1      4
6      4
Name: size, dtype: int64

In [53]:
df["time"].value_counts()

Dinner    176
Lunch      68
Name: time, dtype: int64

In [54]:
df["sex"].value_counts()

Male      157
Female     87
Name: sex, dtype: int64

## replace

Quickly replace values with another one.

In [55]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other,Other


In [56]:
df["time"].replace("Dinner","DN")  # df["time"].replace(to_replace="Dinner", value="DN") şeklinde de yapılabilir.

0      DN
1      DN
2      DN
3      DN
4      DN
       ..
239    DN
240    DN
241    DN
242    DN
243    DN
Name: time, Length: 244, dtype: object

In [57]:
df["sex"].replace(["Female", "Male"],["F", "M"])

# birden fazla değerin değiştirilebilmesi için önce değiştirilecek değerler listesi,
# sonra da yeni değerler listesi yazılmalı ve sıralamaya dikkat edilmelidir.

0      F
1      M
2      M
3      M
4      F
      ..
239    M
240    F
241    M
242    M
243    F
Name: sex, Length: 244, dtype: object

## unique

In [58]:
df["size"].unique()  # belirtilen axis deki unique elemanların değerlerini getirir.

array([2, 3, 4, 1, 6, 5], dtype=int64)

In [59]:
df["size"].nunique()  # belirtilen axis deki uniq elemansayın kaç tane olduğunu döndürür.

6

In [60]:
df["time"].unique()

array(['Dinner', 'Lunch'], dtype=object)

## map

### Bir DataFrame de birden fazla değeri başka değerlerle değiştirmek için kullanılır.

In [61]:
my_map = {"Dinner": "D", "Lunch": "L"}

In [62]:
df["time"] = df["time"].map(my_map)  # DataFrame deki time column değerlerinden "Dinner" ve "Lunch" ı "D" ve "L" ile değiştirmek için

In [63]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
0,16.99,1.01,Female,No,Sun,D,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other
1,10.34,1.66,Male,No,Sun,D,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other,Other
2,21.01,3.5,Male,No,Sun,D,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other,Other
3,23.68,3.31,Male,No,Sun,D,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other,Other
4,24.59,3.61,Female,No,Sun,D,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other,Other


In [64]:
df["time"].value_counts()

D    176
L     68
Name: time, dtype: int64

## Duplicates

### .duplicated() and .drop_duplicates()

In [65]:
# Tekrarlı değeri ilk gördüğü andan itibaren True döndürür.

In [66]:
df["size"].duplicated()

0      False
1      False
2       True
3       True
4      False
       ...  
239     True
240     True
241     True
242     True
243     True
Name: size, Length: 244, dtype: bool

In [67]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
0,16.99,1.01,Female,No,Sun,D,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other
1,10.34,1.66,Male,No,Sun,D,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other,Other
2,21.01,3.5,Male,No,Sun,D,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other,Other
3,23.68,3.31,Male,No,Sun,D,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other,Other
4,24.59,3.61,Female,No,Sun,D,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other,Other


In [68]:
df.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
239,29.03,5.92,Male,No,Sat,D,3,9.68,Michael Avila,5296068606052842,Sat2657,2842,$$,Other,Other
240,27.18,2.0,Female,Yes,Sat,D,2,13.59,Monica Sanders,3506806155565404,Sat1766,5404,$$,Other,Other
241,22.67,2.0,Male,Yes,Sat,D,2,11.34,Keith Wong,6011891618747196,Sat3880,7196,$$,Other,Other
242,17.82,1.75,Male,No,Sat,D,2,8.91,Dennis Dixon,4375220550950,Sat17,950,$$,Other,Other
243,18.78,3.0,Female,No,Thur,D,2,9.39,Michelle Hardin,3511451626698139,Thur672,8139,$$,Other,Other


In [69]:
simple_df = pd.DataFrame([1,2,2],['a','b','c'])
simple_df

Unnamed: 0,0
a,1
b,2
c,2


In [70]:
simple_df.duplicated()

a    False
b    False
c     True
dtype: bool

In [71]:
simple_df.drop_duplicates()  # DataFrame deki tekrar eden satırları temizlemek için

Unnamed: 0,0
a,1
b,2


## between

left: A scalar value that defines the left boundary

right: A scalar value that defines the right boundary

inclusive: A Boolean value which is True by default. If False, it excludes the two passed arguments while checking.

In [72]:
df["total_bill"].between(10,20, inclusive=True)

  df["total_bill"].between(10,20, inclusive=True)


0       True
1       True
2      False
3      False
4      False
       ...  
239    False
240    False
241    False
242     True
243     True
Name: total_bill, Length: 244, dtype: bool

In [73]:
df[df["total_bill"].between(10,20, inclusive=True)]

  df[df["total_bill"].between(10,20, inclusive=True)]


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
0,16.99,1.01,Female,No,Sun,D,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other,Other
1,10.34,1.66,Male,No,Sun,D,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other,Other
8,15.04,1.96,Male,No,Sun,D,2,7.52,Joseph Mcdonald,3522866365840377,Sun6820,0377,$$,Other,Other
9,14.78,3.23,Male,No,Sun,D,2,7.39,Jerome Abbott,3532124519049786,Sun3775,9786,$$,Other,Other
10,10.27,1.71,Male,No,Sun,D,2,5.14,William Riley,566287581219,Sun2546,1219,$$,Other,Other
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234,15.53,3.00,Male,Yes,Sat,D,2,7.76,Tracy Douglas,4097938155941930,Sat7220,1930,$$,Other,Other
235,10.07,1.25,Male,No,Sat,D,2,5.04,Sean Gonzalez,3534021246117605,Sat4615,7605,$$,Other,Other
236,12.60,1.00,Male,Yes,Sat,D,2,6.30,Matthew Myers,3543676378973965,Sat5032,3965,$$,Other,Other
242,17.82,1.75,Male,No,Sat,D,2,8.91,Dennis Dixon,4375220550950,Sat17,0950,$$,Other,Other


## sample

In [74]:
df.sample(5)  # rastgele seçilmiş 5 örnek/satır dan oluşan bir DataFrame döndürür.

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
171,15.81,3.16,Male,Yes,Sat,D,2,7.9,David Hall,502004138207,Sat6750,8207,$$,Other,Other
63,18.29,3.76,Male,Yes,Sat,D,4,4.57,Chad Hart,580171498976,Sat4178,8976,$$,Other,Other
199,13.51,2.0,Male,Yes,Thur,L,2,6.76,Joseph Murphy MD,6547218923471275,Thur2428,1275,$$,Other,Other
92,5.75,1.0,Female,Yes,Fri,D,2,2.88,Leah Ramirez,3508911676966392,Fri3780,6392,$,Other,Other
42,13.94,3.06,Male,No,Sun,D,2,6.97,Bryan Brown,36231182760859,Sun1699,859,$$,Other,Other


In [75]:
df.sample(frac=0.1)  # Veri setindeki örneklerin %10'unu görmek için

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
14,14.83,3.02,Female,No,Sun,D,2,7.42,Vanessa Jones,30016702287574,Sun3848,7574,$$,Other,Other
187,30.46,2.0,Male,Yes,Sun,D,5,6.09,David Barrett,4792882899700988,Sun9987,988,$$$,Other,Other
140,17.47,3.5,Female,No,Thur,L,2,8.74,Kayla Rios,5233918213804470,Thur3906,4470,$$,Other,Other
24,19.82,3.18,Male,No,Sat,D,2,9.91,Christopher Ross,36739148167928,Sat6236,7928,$$,Other,Other
159,16.49,2.0,Male,No,Sun,D,4,4.12,Christopher Soto,30501814271434,Sun1781,1434,$$,Other,Other
194,16.58,4.0,Male,Yes,Thur,L,2,8.29,Benjamin Weber,676210011505,Thur9318,1505,$$,Other,Other
229,22.12,2.88,Female,Yes,Sat,D,2,11.06,Jennifer Russell,4793003293608,Sat3943,3608,$$,Other,Other
130,19.08,1.5,Male,No,Thur,L,2,9.54,Seth Sexton,213113680829581,Thur1446,9581,$$,Other,Other
118,12.43,1.8,Female,No,Thur,L,2,6.22,Dr. Caroline Tucker,502047186908,Thur8084,6908,$$,Other,Other
39,31.27,5.0,Male,No,Sat,D,3,10.42,Mr. Brandon Berry,6011525851069856,Sat6373,9856,$$$,Other,Other


## nlargest and nsmallest

In [76]:
df.nlargest(10, "tip")  # tip columndaki değerlerden en büyük 10 tanesini döndürür.

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
170,50.81,10.0,Male,Yes,Sat,D,3,16.94,Gregory Clark,5473850968388236,Sat1954,8236,$$$,Other,Other
212,48.33,9.0,Male,No,Sat,D,4,12.08,Alex Williamson,676218815212,Sat4590,5212,$$$,Other,Other
23,39.42,7.58,Male,No,Sat,D,4,9.86,Lance Peterson,3542584061609808,Sat239,9808,$$$,Other,Other
59,48.27,6.73,Male,No,Sat,D,4,12.07,Brian Ortiz,6596453823950595,Sat8139,595,$$$,Other,Other
141,34.3,6.7,Male,No,Thur,L,6,5.72,Steven Carlson,3526515703718508,Thur1025,8508,$$$,Other,Other
183,23.17,6.5,Male,Yes,Sun,D,4,5.79,Dr. Michael James,4718501859162,Sun6059,9162,$$,Generous,Generous
214,28.17,6.5,Female,Yes,Sat,D,3,9.39,Marissa Jackson,4922302538691962,Sat3374,1962,$$,Other,Other
47,32.4,6.0,Male,No,Sun,D,4,8.1,James Barnes,3552002592874186,Sun9677,4186,$$$,Other,Other
239,29.03,5.92,Male,No,Sat,D,3,9.68,Michael Avila,5296068606052842,Sat2657,2842,$$,Other,Other
88,24.71,5.85,Male,No,Thur,L,2,12.36,Roger Taylor,4410248629955,Thur9003,9955,$$,Other,Other


In [77]:
df.nsmallest(5, "price_per_person")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,expensive,Tip Quality,Quality
92,5.75,1.0,Female,Yes,Fri,D,2,2.88,Leah Ramirez,3508911676966392,Fri3780,6392,$,Other,Other
67,3.07,1.0,Female,Yes,Sat,D,1,3.07,Tiffany Brock,4359488526995267,Sat3455,5267,$,Generous,Generous
16,10.33,1.67,Female,No,Sun,D,3,3.44,Elizabeth Foster,4240025044626033,Sun9715,6033,$$,Other,Other
1,10.34,1.66,Male,No,Sun,D,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other,Other
172,7.25,5.15,Male,Yes,Sun,D,2,3.62,Larry White,30432617123103,Sun9209,3103,$,Generous,Generous
