# Indexing DataFrames

Import pandas and read-in a small DF of sales data

In [1]:
import pandas as pd

filepath = 'csv_files/SalesJan2009.csv'

df = pd.read_csv(filepath_or_buffer=filepath, parse_dates=['Transaction_date'],
                 usecols=['Transaction_date', 'Product', 'Price', 'Payment_Type'])
df.head()

Unnamed: 0,Transaction_date,Product,Price,Payment_Type
0,2009-01-02 06:17:00,Product1,1200,Mastercard
1,2009-01-02 04:53:00,Product1,1200,Visa
2,2009-01-02 13:08:00,Product1,1200,Mastercard
3,2009-01-03 14:44:00,Product1,1200,Visa
4,2009-01-04 12:56:00,Product2,3600,Visa


### Indexing using square brackets

In [2]:
df['Product'][2] # column label - row label/index
df.Product[2] # equivalent to previous
df.loc[2, 'Product'] # row label - column label
df.iloc[2, 1]

'Product1'

### Creating new DF from previous

In [3]:
df_new = df[['Price', 'Transaction_date']]
df_new.head()

Unnamed: 0,Price,Transaction_date
0,1200,2009-01-02 06:17:00
1,1200,2009-01-02 04:53:00
2,1200,2009-01-02 13:08:00
3,1200,2009-01-03 14:44:00
4,3600,2009-01-04 12:56:00


## Slicing DataFrames

In [4]:
df['Price'].head(3)

0    1200
1    1200
2    1200
Name: Price, dtype: object

By selecting a signle column from a DF we get a pandas Series.

In [5]:
type(df['Price'])

pandas.core.series.Series

Series - one dimensional array with labelled index, a hybrid between a numpy array and a dicitonary.

### Slicing and indexing a Series

In [6]:
df['Price'][1:4]

1    1200
2    1200
3    1200
Name: Price, dtype: object

### Using .loc[]

The first argument in the .loc[] brackets is the row label(s) and the second is for the column labels. Unlike in list slicing, the labelled ending label is inclusive. 

In [7]:
df.loc[:, 'Transaction_date': 'Price'].head(3) # all rows, columns from one to another, latter included
                                               # the same applies for row labels as well if they were
                                               # strings

Unnamed: 0,Transaction_date,Product,Price
0,2009-01-02 06:17:00,Product1,1200
1,2009-01-02 04:53:00,Product1,1200
2,2009-01-02 13:08:00,Product1,1200


.loc[] also accepts lists, for example:

In [8]:
df.loc[:3, ['Transaction_date', 'Price']]

Unnamed: 0,Transaction_date,Price
0,2009-01-02 06:17:00,1200
1,2009-01-02 04:53:00,1200
2,2009-01-02 13:08:00,1200
3,2009-01-03 14:44:00,1200


## Series vs 1-column DF

In [9]:
df['Price'].head(3), type(df['Price'])

(0    1200
 1    1200
 2    1200
 Name: Price, dtype: object, pandas.core.series.Series)

In [10]:
df[['Price']].head(3), type(df[['Price']])

(  Price
 0  1200
 1  1200
 2  1200, pandas.core.frame.DataFrame)

## Slicing in reverse order

In [11]:
df.loc[:5, 'Payment_Type'].head(6) # usual order

0    Mastercard
1          Visa
2    Mastercard
3          Visa
4          Visa
5          Visa
Name: Payment_Type, dtype: object

In [12]:
df.loc[5:0:-1, 'Payment_Type'].head(6) # reverse order

5          Visa
4          Visa
3          Visa
2    Mastercard
1          Visa
0    Mastercard
Name: Payment_Type, dtype: object

## Filtering DataFrames

Filtering data from DFs not by labels and indexes but by properties.

In [2]:
import pandas as pd
import numpy as np
file = 'csv_files/world_population.csv'

df_world = pd.read_csv(filepath_or_buffer=file, skiprows=4)

df_world.columns

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', 'Unnamed: 62'],
      dtype='object')

In [14]:
df_world_small = df_world[['Country Name', 'Country Code', '2010', '2015']]

df_world_small.head(4)

Unnamed: 0,Country Name,Country Code,2010,2015
0,Aruba,ABW,101669.0,104341.0
1,Afghanistan,AFG,28803167.0,33736494.0
2,Angola,AGO,23369131.0,27859305.0
3,Albania,ALB,2913021.0,2880703.0


### Filtering with Boolean Series

Filtering DF with population bigger than 1 billion.

In [15]:
df_world_small[df_world_small['2010'] > 1e9].head()

Unnamed: 0,Country Name,Country Code,2010,2015
38,China,CHN,1337705000.0,1371220000.0
59,East Asia & Pacific (excluding high income),EAP,1966232000.0,2038411000.0
60,Early-demographic dividend,EAR,2909411000.0,3127579000.0
61,East Asia & Pacific,EAS,2207155000.0,2283108000.0
93,High income,HIC,1149514000.0,1182932000.0


### Combining filters

In [16]:
df_world_small[(df_world_small['2010'] > 1e6) & (df_world_small['2010'] < 1.5e6)] # both conditions

Unnamed: 0,Country Name,Country Code,2010,2015
20,Bahrain,BHR,1240862.0,1371855.0
51,Cyprus,CYP,1112607.0,1160985.0
69,Estonia,EST,1331475.0,1315407.0
165,Mauritius,MUS,1250400.0,1262605.0
222,Swaziland,SWZ,1202843.0,1319011.0
235,Timor-Leste,TLS,1109591.0,1240977.0
240,Trinidad and Tobago,TTO,1328100.0,1360092.0


In [17]:
df_world_small[(df_world_small['2010'] > 1e6) | (df_world_small['2015'] < 1.5e6)].head() # either condition

Unnamed: 0,Country Name,Country Code,2010,2015
0,Aruba,ABW,101669.0,104341.0
1,Afghanistan,AFG,28803167.0,33736494.0
2,Angola,AGO,23369131.0,27859305.0
3,Albania,ALB,2913021.0,2880703.0
4,Andorra,AND,84449.0,78014.0


### DFs with 0-s and NaN-s

Lets create a new DF which has also 0 and NaN values

In [18]:
df_world_2 = df_world_small.head().copy()

df_world_2['2019'] = [0, 0, np.nan, 3e7, 80000]

df_world_2

Unnamed: 0,Country Name,Country Code,2010,2015,2019
0,Aruba,ABW,101669.0,104341.0,0.0
1,Afghanistan,AFG,28803167.0,33736494.0,0.0
2,Angola,AGO,23369131.0,27859305.0,
3,Albania,ALB,2913021.0,2880703.0,30000000.0
4,Andorra,AND,84449.0,78014.0,80000.0


Columns with all NONZERO values. Other columns are rejected.

In [19]:
df_world_2.loc[:, df_world_2.all()]

Unnamed: 0,Country Name,Country Code,2010,2015
0,Aruba,ABW,101669.0,104341.0
1,Afghanistan,AFG,28803167.0,33736494.0
2,Angola,AGO,23369131.0,27859305.0
3,Albania,ALB,2913021.0,2880703.0
4,Andorra,AND,84449.0,78014.0


Columns with any nonzero entries, finding columns if all the values are 0-s.

In [20]:
df_world_2.loc[:, df_world_2.any()]

Unnamed: 0,Country Name,Country Code,2010,2015,2019
0,Aruba,ABW,101669.0,104341.0,0.0
1,Afghanistan,AFG,28803167.0,33736494.0,0.0
2,Angola,AGO,23369131.0,27859305.0,
3,Albania,ALB,2913021.0,2880703.0,30000000.0
4,Andorra,AND,84449.0,78014.0,80000.0


Since there are none such columns where all are zeros the same DF is printed.

Select columns with any NaN values

In [21]:
df_world_2.loc[:, df_world_2.isnull().any()]

Unnamed: 0,2019
0,0.0
1,0.0
2,
3,30000000.0
4,80000.0


Only COLUMNS that has no NaN values

In [22]:
df_world_2.loc[:, df_world_2.notnull().all()]

Unnamed: 0,Country Name,Country Code,2010,2015
0,Aruba,ABW,101669.0,104341.0
1,Afghanistan,AFG,28803167.0,33736494.0
2,Angola,AGO,23369131.0,27859305.0
3,Albania,ALB,2913021.0,2880703.0
4,Andorra,AND,84449.0,78014.0


Drop ROWS with any NaN values

In [23]:
df_world_2.dropna(how='any')

Unnamed: 0,Country Name,Country Code,2010,2015,2019
0,Aruba,ABW,101669.0,104341.0,0.0
1,Afghanistan,AFG,28803167.0,33736494.0,0.0
3,Albania,ALB,2913021.0,2880703.0,30000000.0
4,Andorra,AND,84449.0,78014.0,80000.0


## Transforming DFs

Just creating one convenient DF.

In [3]:
file_path = 'csv_files/kontovv.csv'

df_kontovv = pd.read_csv(filepath_or_buffer=file_path, sep=';', index_col='Kuupäev', 
                         parse_dates=True, usecols=['Kuupäev', 'Summa', 'Selgitus', 'Teenustasu'],
                         decimal=',')
# decimal= parameter indicates the decimal symbol and convert it into a dot.

df_kontovv.head()

Unnamed: 0_level_0,Summa,Selgitus,Teenustasu
Kuupäev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-02,32.7,31/01/2018 18:15 kaart...590614 Retro kohvik ...,0
2018-01-02,7.2,30/01/2018 13:57 kaart...590614 Olerex Sopruse...,0
2018-01-02,7.0,Autosõit,0
2018-01-02,6.0,Pizza,0
2018-01-02,5.68,29/01/2018 22:03 kaart...590614 Apollo Kino Mu...,0


In [34]:
df_kontovv['Selgitus'] = df_kontovv['Selgitus'].map(lambda x: x[32:-8] if x[25:31] == '590614' else x)

df_kontovv.head()

Unnamed: 0_level_0,Summa,Selgitus,Teenustasu
Kuupäev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-02,32.7,Retro kohvik,0
2018-01-02,7.2,Olerex Sopruse pst,0
2018-01-02,7.0,Autosõit,0
2018-01-02,6.0,Pizza,0
2018-01-02,5.68,Apollo Kino Mustamae,0


### DataFrame vectorized methods

Pandas built in methods. The most efficient.

In [39]:
df_kontovv.Summa.floordiv(5).head(3) # coverting to units of 5-s and rounded down

Kuupäev
2018-01-02    6.0
2018-01-02    1.0
2018-01-02    1.0
Name: Summa, dtype: float64

Numpy method

In [44]:
import numpy as np

np.floor_divide(df_kontovv.Summa, 5).head(3) # coverting to units of 5-s and rounded down

Kuupäev
2018-01-02    6.0
2018-01-02    1.0
2018-01-02    1.0
Name: Summa, dtype: float64

Plain Python functions

In [45]:
def fives(n):
    return n//5 # coverting to units of 5-s and rounded down

df_kontovv.Summa.apply(fives).head(3)

Kuupäev
2018-01-02    6.0
2018-01-02    1.0
2018-01-02    1.0
Name: Summa, dtype: float64

Lambda function method

In [46]:
df_kontovv.Summa.apply(lambda x: x//5).head(3)

Kuupäev
2018-01-02    6.0
2018-01-02    1.0
2018-01-02    1.0
Name: Summa, dtype: float64

All those previous methods create a new DF, not changing the original DF. But to preserve created calculation we can create a new column. First lets alter the 'Teenustasu' column to give it some numbers.

In [50]:
df_kontovv.Teenustasu = df_kontovv.Summa * 0.1

df_kontovv.head(3)

Unnamed: 0_level_0,Summa,Selgitus,Teenustasu
Kuupäev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-02,32.7,Retro kohvik,3.27
2018-01-02,7.2,Olerex Sopruse pst,0.72
2018-01-02,7.0,Autosõit,0.7


Lets create a new column

In [51]:
df_kontovv['jagatud_viiega_summa'] = df_kontovv.Summa.floordiv(5)

df_kontovv.head(3)

Unnamed: 0_level_0,Summa,Selgitus,Teenustasu,jagatud_viiega_summa
Kuupäev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-01-02,32.7,Retro kohvik,3.27,6.0
2018-01-02,7.2,Olerex Sopruse pst,0.72,1.0
2018-01-02,7.0,Autosõit,0.7,1.0


String operations, for example with the column labels

In [58]:
df_kontovv.columns = df_kontovv.columns.str.lower()

In [59]:
df_kontovv.head()

Unnamed: 0_level_0,summa,selgitus,teenustasu,jagatud_viiega_summa
Kuupäev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-01-02,32.7,Retro kohvik,3.27,6.0
2018-01-02,7.2,Olerex Sopruse pst,0.72,1.0
2018-01-02,7.0,Autosõit,0.7,1.0
2018-01-02,6.0,Pizza,0.6,1.0
2018-01-02,5.68,Apollo Kino Mustamae,0.568,1.0


Creating new columns from previous

In [60]:
df_kontovv['kogu_summa'] = df_kontovv.summa + df_kontovv.teenustasu

df_kontovv.head(3)

Unnamed: 0_level_0,summa,selgitus,teenustasu,jagatud_viiega_summa,kogu_summa
Kuupäev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-02,32.7,Retro kohvik,3.27,6.0,35.97
2018-01-02,7.2,Olerex Sopruse pst,0.72,1.0,7.92
2018-01-02,7.0,Autosõit,0.7,1.0,7.7
