# Pandas 

In this notebook we are going to dig deeper into the Pandas package. From creating a DataFrame, to make simple operations with the dataframe. The last part will cover how to import data and manipulate DataFrames

## Introduction

In [1]:
import numpy as np
import pandas as pd

Define a dataframe with Pandas

In [12]:
icecream_sales = np.array([30, 40, 35, 130, 120, 60])
weather_coded = np.array([0, 1, 0, 1, 1, 0])
customers = np.array([2000, 2100, 1500, 8000, 7200, 2000])
df = pd.DataFrame({'icecream_sales': icecream_sales,
                   'weather_coded': weather_coded,
                   'customers': customers})

print(df)

   icecream_sales  weather_coded  customers
0              30              0       2000
1              40              1       2100
2              35              0       1500
3             130              1       8000
4             120              1       7200
5              60              0       2000


Define and assign an index (six ends of month starting in April, 2010)

In [14]:
ourIndex = pd.date_range(start='04/2010', freq='M', periods=6)
df.set_index(ourIndex, inplace=True)

Print the DataFrame

In [15]:
print(f'df: \n{df}\n')

df: 
            icecream_sales  weather_coded  customers
2010-04-30              30              0       2000
2010-05-31              40              1       2100
2010-06-30              35              0       1500
2010-07-31             130              1       8000
2010-08-31             120              1       7200
2010-09-30              60              0       2000



In [16]:
subset1 = df[['icecream_sales', 'customers']] #creating a subset only using column from the Dataframe, use double bracet '[[]]'
print(f'subset1: \n{subset1}\n')

subset1: 
            icecream_sales  customers
2010-04-30              30       2000
2010-05-31              40       2100
2010-06-30              35       1500
2010-07-31             130       8000
2010-08-31             120       7200
2010-09-30              60       2000



In [17]:
subset2 = df[1:4]  # same as df['2010-05-31':'2010-07-31']
#creating a subset only using row from the Dataframe, use single bracet '[]'
print(f'subset2: \n{subset2}\n')

subset2: 
            icecream_sales  weather_coded  customers
2010-05-31              40              1       2100
2010-06-30              35              0       1500
2010-07-31             130              1       8000



Access rows and columns by index and variable names:

In [18]:
subset3 = df.loc['2010-05-31', 'customers']  # same as df.iloc[1,2]
#'.loc' use row and column name to select data
#''.iloc' use row and column number to select data
print(f'subset3: \n{subset3}\n')

subset3: 
2100



Access rows and columns by index and variable integer positions:

In [22]:
subset4 = df.iloc[1:4, 0:2]
print(f'subset4: \n{subset4}\n')

subset4: 
            icecream_sales  weather_coded
2010-05-31              40              1
2010-06-30              35              0
2010-07-31             130              1



## DataFrame manipulation

We are going to see now some of the operations that can be done based on the constructed data frame

Include sales two months ago

In [25]:
df['icecream_sales_lag2'] = df['icecream_sales'].shift(2)
print(f'df: \n{df}\n')

df: 
            icecream_sales  weather_coded  customers  icecream_sales_lag2
2010-04-30              30              0       2000                  NaN
2010-05-31              40              1       2100                  NaN
2010-06-30              35              0       1500                 30.0
2010-07-31             130              1       8000                 40.0
2010-08-31             120              1       7200                 35.0
2010-09-30              60              0       2000                130.0



Use a pandas.Categorical object to attach labels (0 = bad; 1 = good):

In [26]:
df['weather'] = pd.Categorical.from_codes(codes=df['weather_coded'],
                                          categories=['bad', 'good'])
print(f'df: \n{df}\n')

df: 
            icecream_sales  weather_coded  customers  icecream_sales_lag2  \
2010-04-30              30              0       2000                  NaN   
2010-05-31              40              1       2100                  NaN   
2010-06-30              35              0       1500                 30.0   
2010-07-31             130              1       8000                 40.0   
2010-08-31             120              1       7200                 35.0   
2010-09-30              60              0       2000                130.0   

           weather  
2010-04-30     bad  
2010-05-31    good  
2010-06-30     bad  
2010-07-31    good  
2010-08-31    good  
2010-09-30     bad  



Calculate the mean sales for each wheather category:

In [27]:
group_means = df.groupby('weather').mean()
print(f'group_means: \n{group_means}\n')

group_means: 
         icecream_sales  weather_coded    customers  icecream_sales_lag2
weather                                                                 
bad           41.666667            0.0  1833.333333                 80.0
good          96.666667            1.0  5766.666667                 37.5



Calculated the median sales for each wheather category:

In [28]:
group_means = df.groupby('weather').median()
print(f'group_means: \n{group_means}\n')

group_means: 
         icecream_sales  weather_coded  customers  icecream_sales_lag2
weather                                                               
bad                35.0            0.0     2000.0                 80.0
good              120.0            1.0     7200.0                 37.5

