# Pandas Review

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

You can find it here: http://pandas.pydata.org/

And the documentation can be found here: http://pandas.pydata.org/pandas-docs/stable/

In this notebook we review some of its functionality.

In [1]:
import pandas as pd
import numpy as np

## Create random data

Create a dataframe with four features (columns), "A","B","C","D". The corresponding values could be (homoegeneous or heterogeneous) arrays, scalars, strings, missing values (NaN), etc.  
Indexing of the records (rows) is automatic (starting from 0). 

In [3]:
df=pd.DataFrame({'A':np.array([1,7,2,-2],dtype='int32'),
                'B':1,
                'C':['pippo']*4,
                'D':np.array([0.5,'pluto',np.nan,np.nan])})
df

Unnamed: 0,A,B,C,D
0,1,1,pippo,0.5
1,7,1,pippo,pluto
2,2,1,pippo,
3,-2,1,pippo,


Now inspect the dataframe a little bit

In [4]:
df.head(3)

Unnamed: 0,A,B,C,D
0,1,1,pippo,0.5
1,7,1,pippo,pluto
2,2,1,pippo,


In [5]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [7]:
df.values

array([[1, 1, 'pippo', '0.5'],
       [7, 1, 'pippo', 'pluto'],
       [2, 1, 'pippo', 'nan'],
       [-2, 1, 'pippo', 'nan']], dtype=object)

Try some operations

In [10]:
df.T

Unnamed: 0,0,1,2,3
A,1,7,2,-2
B,1,1,1,1
C,pippo,pippo,pippo,pippo
D,0.5,pluto,,


In [11]:
df.sort_index(axis=0,ascending=False)

Unnamed: 0,A,B,C,D
3,-2,1,pippo,
2,2,1,pippo,
1,7,1,pippo,pluto
0,1,1,pippo,0.5


In [12]:
df.sort_values(by='A')

Unnamed: 0,A,B,C,D
3,-2,1,pippo,
0,1,1,pippo,0.5
2,2,1,pippo,
1,7,1,pippo,pluto


## Create and modify a  data frame
Now create a dataframe with with the dates (time series) as index and a corresponding 7x2 (7 records with 2 features) matrix of random numbers. The 2 features are "
Temperature" and "Humidity"

In [22]:
dates=pd.date_range('20170101', periods=7, freq='D')
dates

DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06', '2017-01-07'],
              dtype='datetime64[ns]', freq='D')

In [21]:
a=pd.DatetimeIndex(['2017-01-01','2017-05-01'])
a

DatetimeIndex(['2017-01-01', '2017-05-01'], dtype='datetime64[ns]', freq=None)

In [23]:
df = pd.DataFrame(np.random.rand(7,2), index=dates, columns=['Temperature','Humidity'])
df

Unnamed: 0,Temperature,Humidity
2017-01-01,0.051779,0.503905
2017-01-02,0.96895,0.509555
2017-01-03,0.406067,0.869644
2017-01-04,0.938972,0.515843
2017-01-05,0.383742,0.960885
2017-01-06,0.27703,0.389914
2017-01-07,0.259599,0.444251


Add a column with power consumption in kWh 

In [24]:
df['Power Consumption']=np.random.rand(7,1)+7.4
df

Unnamed: 0,Temperature,Humidity,Power Consumption
2017-01-01,0.051779,0.503905,7.410935
2017-01-02,0.96895,0.509555,8.281145
2017-01-03,0.406067,0.869644,8.081862
2017-01-04,0.938972,0.515843,7.951819
2017-01-05,0.383742,0.960885,7.982629
2017-01-06,0.27703,0.389914,7.673076
2017-01-07,0.259599,0.444251,7.690727


Modify the temperature to have a reasonable value for the season

In [25]:
df['Temperature']=np.random.randint(2,high=12, size=(7))
df

Unnamed: 0,Temperature,Humidity,Power Consumption
2017-01-01,7,0.503905,7.410935
2017-01-02,5,0.509555,8.281145
2017-01-03,2,0.869644,8.081862
2017-01-04,3,0.515843,7.951819
2017-01-05,10,0.960885,7.982629
2017-01-06,11,0.389914,7.673076
2017-01-07,9,0.444251,7.690727


Add a bogus feature and then remove it

In [26]:
df['bogus']=np.nan
df

Unnamed: 0,Temperature,Humidity,Power Consumption,bogus
2017-01-01,7,0.503905,7.410935,
2017-01-02,5,0.509555,8.281145,
2017-01-03,2,0.869644,8.081862,
2017-01-04,3,0.515843,7.951819,
2017-01-05,10,0.960885,7.982629,
2017-01-06,11,0.389914,7.673076,
2017-01-07,9,0.444251,7.690727,


In [27]:
df=df.drop(['bogus'],axis=1)
df.head()

Unnamed: 0,Temperature,Humidity,Power Consumption
2017-01-01,7,0.503905,7.410935
2017-01-02,5,0.509555,8.281145
2017-01-03,2,0.869644,8.081862
2017-01-04,3,0.515843,7.951819
2017-01-05,10,0.960885,7.982629


## Indexing

Try to figure out what each of the following indexing method does.

If in trouble check here: http://pandas.pydata.org/pandas-docs/stable/indexing.html

Let's open a file with data about weight and height of a sample of men and women

## Getting 
Select a column (feature) by feature's name 

In [None]:
df['Temperature'] # same as df.Temperature which is however sometimes impractical

In [None]:
df[['Temperature','Humidity']]

Select rows by indeces

In [None]:
df[0:2]

Getting a cross section of the table by label

In [None]:
df.loc[dates[0]]
#df.loc[:,['Humidity','Power Consumption']]
#df.loc['20170102',['Humidity','Temperature']]

Selection by Position

In [None]:
df.iloc[3]
#df.ix[0]

In [None]:
df.iloc[2:4,1:3]

Boolean selection

In [None]:
df[df['Temperature'] >= 7 ]

In [None]:
df[df > 5]

In [None]:
df2 = df.copy()
df2['Quality'] = ['nice day', 'nice day','soso day','bad day','bad day','soso day', 'wonderful day']
df2

In [None]:
df2[df2['Quality'].isin(['wonderful day','nice day'])]

## Setting

Setting a new column automatically aligns the data by the indexes

In [None]:
s1 = pd.Series(range(1,8), index=pd.date_range('20170102', periods=7))
print s1
df['Bogus'] = s1
df

In [None]:
df2.iloc[0,1] = 0
df2.loc[:,'Temperature'] = np.array([5] * len(df))
df2

Example of fancy operation (column removal if records are not numbers)

In [None]:
for column in df2.columns:
    if not np.issubdtype(df2[column].dtype, np.number):
        df2.drop(column, axis=1, inplace=True)
df2

Some other global operations

In [None]:
#-df
#df*5
df2[df2['Power Consumption']>8]=-df2
df2

Theres much more that Pandas can do for you. Make sure to check the documentation: http://pandas.pydata.org/pandas-docs/stable/

## Exercises:

now try excercise with the following dataset:

In [None]:
df = pd.read_csv("../data/titanic-train.csv")

- select passengers that survived
- select passengers that embarked in port S
- select male passengers
- select passengers who paid less than 40.000 and were in third class
- locate the name of passegner Id 674
- calculate the average age of passengers using the function mean()
- count the number of survived and the number of dead passengers
- count the number of males and females
- count the number of survived and dead per each gender
- calculate average price paid by survived and dead people
