### Week 2 continued

In [1]:
import pandas as pd
import numpy as np

You can produce dataframes in various ways including: Combining series, combining dictionaries.

In [9]:
s1 = pd.Series({'name': 'Elvis',
                'item purchased': 'Rhinestones',
                'cost': 12.65})
s2 = pd.Series({'name': 'Priscilla',
                'item purchased': 'Hairspray',
                'cost': 1.65})
s3 = pd.Series({'name': 'Michael',
                'item purchased': 'Nose Job',
                'cost': 23456})


Note that when making the dataframe both row and col indices can be non-unique

In [10]:
purchase = pd.DataFrame([s1, s2, s3], index=['Store 1', 'Store 1', 'Store 2'])

In [11]:
purchase

Unnamed: 0,cost,item purchased,name
Store 1,12.65,Rhinestones,Elvis
Store 1,1.65,Hairspray,Priscilla
Store 2,23456.0,Nose Job,Michael


When querying with one parameter using iloc and loc on dataframes you get a series if there is only one index item to return.

Querying for the repeated index produces a dataframe of that repeated index

In [5]:
purchase.loc['Store 1']

Unnamed: 0,cost,item purchased,name
Store 1,12.65,Rhinestones,Elvis
Store 1,1.65,Hairspray,Priscilla


In [6]:
purchase.loc['Store 2']

cost                 23456
item purchased    Nose Job
name               Michael
Name: Store 2, dtype: object

In [7]:
purchase['item purchased']

Store 1    Rhinestones
Store 1      Hairspray
Store 2       Nose Job
Name: item purchased, dtype: object

Two arguements to loc gets the row and col

In [8]:
purchase.loc['Store 1', 'cost']

Store 1    12.65
Store 1     1.65
Name: cost, dtype: object

What if we want a whole column. Various options:

    1) transpose (but is ugly)
    
    2) straight out index as all columns have a name

In [9]:
purchase.T.loc['cost']

Store 1    12.65
Store 1     1.65
Store 2    23456
Name: cost, dtype: object

In [10]:
purchase['cost']

Store 1    12.65
Store 1     1.65
Store 2    23456
Name: cost, dtype: object

You can chain loc/iloc and indexing

In [11]:
purchase.loc['Store 1']['cost']

Store 1    12.65
Store 1     1.65
Name: cost, dtype: object

However chaining comes at a cost.  Tends to produce copies of the dataframe.

Particularly when changing values, its better to use various arguements to .loc


The below shows that you provide a : to slice all rows, and the second arguement can be a list of columns.

In [12]:
purchase.loc[:,['cost', 'name']]

Unnamed: 0,cost,name
Store 1,12.65,Elvis
Store 1,1.65,Priscilla
Store 2,23456.0,Michael


In [13]:
purchase.loc['Store 2', 'name'] = 'Lisa Marie'

In [14]:
purchase

Unnamed: 0,cost,item purchased,name
Store 1,12.65,Rhinestones,Elvis
Store 1,1.65,Hairspray,Priscilla
Store 2,23456.0,Nose Job,Lisa Marie


In the pandas world, friends dont let friends chain calls!

Remember, panda dataframes are just a 2-axis labelled array

### dropping data

note, this returns a df with the data dropped - it doesnt change the df

In [15]:
purchase.drop('Store 1')

Unnamed: 0,cost,item purchased,name
Store 2,23456,Nose Job,Lisa Marie


In [16]:
purchase

Unnamed: 0,cost,item purchased,name
Store 1,12.65,Rhinestones,Elvis
Store 1,1.65,Hairspray,Priscilla
Store 2,23456.0,Nose Job,Lisa Marie


Can delete columns using del - note this directly works on the original df!

In [18]:
copy_df = purchase.copy()
del copy_df['name']
copy_df

Unnamed: 0,cost,item purchased
Store 1,12.65,Rhinestones
Store 1,1.65,Hairspray
Store 2,23456.0,Nose Job


### adding a column

In [19]:
purchase['location'] = None

In [20]:
purchase

Unnamed: 0,cost,item purchased,name,location
Store 1,12.65,Rhinestones,Elvis,
Store 1,1.65,Hairspray,Priscilla,
Store 2,23456.0,Nose Job,Lisa Marie,


In [21]:
purchase['poulet'] = [1,2,3]

In [22]:
purchase

Unnamed: 0,cost,item purchased,name,location,poulet
Store 1,12.65,Rhinestones,Elvis,,1
Store 1,1.65,Hairspray,Priscilla,,2
Store 2,23456.0,Nose Job,Lisa Marie,,3


### Accessing Dataframes

Remember that when you index a dataframe you are accessing a view, and if you change the view you change the underlying data.  If you want to only change the data in a new dataset, consider using the copy() method

In [12]:
costs = purchase['cost']
costs

Store 1       12.65
Store 1        1.65
Store 2    23456.00
Name: cost, dtype: float64

In [13]:
costs += 2

In [14]:
costs

Store 1       14.65
Store 1        3.65
Store 2    23458.00
Name: cost, dtype: float64

In [15]:
purchase

Unnamed: 0,cost,item purchased,name
Store 1,14.65,Rhinestones,Elvis
Store 1,3.65,Hairspray,Priscilla
Store 2,23458.0,Nose Job,Michael


In [16]:
!cat olympics.csv

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [17]:
df = pd.read_csv('olympics.csv')

In [18]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№ Games,01 !,02 !,03 !,Combined total
1,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
2,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
3,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
4,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12


In [22]:
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

In [23]:
df.head()

Unnamed: 0,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


In [24]:
df.columns

Index(['№ Summer', '01 !', '02 !', '03 !', 'Total', '№ Winter', '01 !.1',
       '02 !.1', '03 !.1', 'Total.1', '№ Games', '01 !.2', '02 !.2', '03 !.2',
       'Combined total'],
      dtype='object')

In [None]:
for col in df.columns:
    if col[:2] == '01':
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2] == '02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)