# Pandas


**Pandas** (from 'Panels data') is built on NumPy and is a critical package for a data scientist

**Data munging** - taking data in one format and making it conform to another format

#### Two main Pandas data objects:
* Series
* DataFrames

### Series

The series is essentially a labelled NumPy vector



In [1]:
import pandas as pd

In [97]:
prices = pd.Series([1,1,2,3,5],
            index =['apple', 'pear', 'banana', 'mango', 'jackfruit'])


In [98]:
print(prices)

apple        1
pear         1
banana       2
mango        3
jackfruit    5
dtype: int64


Inspecting the `.index` attribute of the series, we see the following:

In [6]:
prices.index

Index(['apple', 'pear', 'banana', 'mango', 'jackfruit'], dtype='object')

This index allows slicing based on index label, in addition to being able to slice by the numeric index value, as in a list or array.

Single values can be accessed by simply passing the index label within square brackets:

In [8]:
prices['mango']

3

Use the `.loc` method to slice by the index label:

In [9]:
prices.loc['banana':]

banana       2
mango        3
jackfruit    5
dtype: int64

You can also use `.loc` method to subset by a list.

In [12]:
my_fruits = ['apple', 'banana', 'mango']

In [13]:
prices.loc[my_fruits]

apple     1
banana    2
mango     3
dtype: int64

Use the `.iloc` method to slice by numeric index value:

In [10]:
prices.iloc[0:3]

apple     1
pear      1
banana    2
dtype: int64

**Warning**: Since Pandas allows Series to have numeric labels, there is the potential to confuse numeric indexes and numeric labels. Always be sure to explicitly specify `.loc` or `.iloc` to avoid confusion.

### Series Operations

Mathematical operations between series are sensible and are based on the index labels. 

In [16]:
inventory = pd.Series([10, 50, 41, 22],
            index=['pear', 'banana', 'mango', 'apple'])

In [17]:
prices * inventory

apple         22.0
banana       100.0
jackfruit      NaN
mango        123.0
pear          10.0
dtype: float64

Comparison operators on a Series will return a boolean Series, which is useful for subsetting.

In [18]:
inventory > 20

pear      False
banana     True
mango      True
apple      True
dtype: bool

Here are a couple ways to accomplish a subsetting operation using this behavior:

In [19]:
inventory.loc[inventory > 20]

banana    50
mango     41
apple     22
dtype: int64

In [24]:
excess_inv = inventory > 40

In [26]:
inventory.loc[excess_inv]

banana    50
mango     41
dtype: int64

### Useful Series methods

Pandas has many methods for the Series object, including some that are especially useful for data exploration and basic statistical calculations:

`.mean()`        --> calculates arithmetic average

`.std()`         --> calculates standard deviation

`.median()`      --> finds median

`.describe()`    --> calculates summary statistics


In [27]:
prices.mean()

2.4

In [28]:
inventory.std()

18.09926333675858

In [29]:
prices.median()

2.0

In [30]:
inventory.describe()

count     4.000000
mean     30.750000
std      18.099263
min      10.000000
25%      19.000000
50%      31.500000
75%      43.250000
max      50.000000
dtype: float64

### Iteration

* Conventional iteration is possible, but not advised
* FOr applying some function or operation to each element in a Pandas Series, utilize the `.apply` method, possibly in combination with a `lambda` function.

In [33]:
disc_prices = prices.apply(lambda x: 0.9 * x if x > 3 else x)

In [34]:
disc_prices

apple        1.0
pear         1.0
banana       2.0
mango        3.0
jackfruit    4.5
dtype: float64

pandas Series documentation:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

#### Pandas Series Challenges:

In [74]:
"""
    Using .iloc, select the items in inventory at indices 1 and 3.
    (This is possible, but may take some creative thinking.)
    Select the items with prices of less than the mean price.
    Return the total value of mangoes on hand.
"""


import pandas as pd
prices = pd.Series([1,1,2,3,5],
              index=['apple', 'pear', 'banana', 'mango', 'jackfruit'])

inventory = pd.Series([10, 50, 41, 22],
              index=['pear', 'banana', 'mango', 'apple'])


one_ind = inventory.iloc[1::2]
two_less = prices.loc[prices < prices.mean()]
three_onhand = (prices['mango'] * inventory['mango'])

In [75]:
print(one_ind)

banana    50
apple     22
dtype: int64


In [42]:
prices.loc[prices < prices.mean()]

apple     1
pear      1
banana    2
dtype: int64

In [43]:
mango_val = (prices['mango'] * inventory['mango'])

In [44]:
mango_val

123

In [45]:
inventory.iloc()

<pandas.core.indexing._iLocIndexer at 0x7fb1c007fc78>

In [76]:
inventory.iloc[1]


50

In [77]:
inventory.iloc[3]

22

In [64]:
inventory.index[1]

'banana'

### DataFrames

vectors : series :: matrices : dataframes

DataFrames can be created from a variety of sources, including nested lists, dictionaries, NumPy arrays, Series, Excel spreadsheets, etc.

Below we instantiate a DataFrame from a dictionary based on the Seies we used earlier

In [78]:
produce = pd.DataFrame({'price': prices, 'disc_price': disc_prices,
                       'inventory': inventory})

In [79]:
produce

Unnamed: 0,price,disc_price,inventory
apple,1,1.0,22.0
banana,2,2.0,50.0
jackfruit,5,4.5,
mango,3,3.0,41.0
pear,1,1.0,10.0


In [82]:
# access a single value by passing row name and column name

produce.loc['pear','price']

1

In [89]:
# slice rows by numeric indexes

produce.iloc[2:, [0,2]]


Unnamed: 0,price,inventory
jackfruit,5,
mango,3,41.0
pear,1,10.0


In [95]:
# slice returning all columns for specific rows

produce.iloc[[0, 2, 4], :]



Unnamed: 0,price,disc_price,inventory
apple,1,1.0,22.0
jackfruit,5,4.5,
pear,1,1.0,10.0


In [96]:
# select an entire column based on name

produce['disc_price']

apple        1.0
banana       2.0
jackfruit    4.5
mango        3.0
pear         1.0
Name: disc_price, dtype: float64

In [100]:
# select all rows an return only columns with a max value
# greater than 5

produce.loc[:, produce.max() > 5]

Unnamed: 0,inventory
apple,22.0
banana,50.0
jackfruit,
mango,41.0
pear,10.0


In [101]:
# slice rows where price equals 1 and return all columns

produce.loc[produce['price'] == 1, :]

Unnamed: 0,price,disc_price,inventory
apple,1,1.0,22.0
pear,1,1.0,10.0


### Adding and Removing Columns

Creating a new colum requires only to state the desired name and to assign a Series or other data type to it.




In [104]:
produce['inventory_val'] = produce['inventory'] * produce['price']

In [105]:
produce

Unnamed: 0,price,disc_price,inventory,inventory_val
apple,1,1.0,22.0,22.0
banana,2,2.0,50.0,100.0
jackfruit,5,4.5,,
mango,3,3.0,41.0,123.0
pear,1,1.0,10.0,10.0


Removing columns can be accomplished with the `.drop` method, but requires a couple of extra arguments.

The `.drop` method works on **rows** by default. To make it work on columns, we must specify an `axis` argument (rows: axis=0, columns: axis=1).

The default behavior is also that `.drop` creates a *new* DataFrame or Series. In order to actually drop the row or column from an existing DataFrame it is necessary to set the `inplace` argument to **True**.

In [106]:
produce.drop('inventory_val', axis = 1, inplace = True)

In [107]:
produce

Unnamed: 0,price,disc_price,inventory
apple,1,1.0,22.0
banana,2,2.0,50.0
jackfruit,5,4.5,
mango,3,3.0,41.0
pear,1,1.0,10.0


### Updating Values

Values in the DataFrame can be updated via explicit assignment.

In the example below, all fruit with a price >= 3 has its price reset to 2.50.

In [109]:
produce.loc[produce.price>=3, 'price'] = 2.50


In [110]:
produce

Unnamed: 0,price,disc_price,inventory
apple,1.0,1.0,22.0
banana,2.0,2.0,50.0
jackfruit,2.5,4.5,
mango,2.5,3.0,41.0
pear,1.0,1.0,10.0


#### DataFrame Challenge Problem

In [117]:
"""
    Select all information from produce for pear and jackfruit
    Create a clearance_price column in produce which includes prices reduced from the original prices by 50%. Then set two_clearance equal the entire row at index 3.
"""

import pandas as pd

prices = pd.Series([1,1,2,3,5],
              index=['apple', 'pear', 'banana', 'mango', 'jackfruit'])

inventory = pd.Series([10, 50, 41, 22],
              index=['pear', 'banana', 'mango', 'apple'])

discount_prices = prices.apply(lambda x: .9*x if x>3 else x)

produce = pd.DataFrame({'price':prices,
                        'discount_price':discount_prices,
                        'inventory':inventory})



one_select = produce.loc[['pear','jackfruit'],:]
produce['clearance_price'] = produce['price'] * 0.5

two_clearance = produce.iloc[3,:]





In [118]:
print(one_select)

           price  discount_price  inventory
pear           1             1.0       10.0
jackfruit      5             4.5        NaN


In [119]:
produce


Unnamed: 0,price,discount_price,inventory,clearance_price
apple,1,1.0,22.0,0.5
banana,2,2.0,50.0,1.0
jackfruit,5,4.5,,2.5
mango,3,3.0,41.0,1.5
pear,1,1.0,10.0,0.5


In [120]:
two_clearance

price               3.0
discount_price      3.0
inventory          41.0
clearance_price     1.5
Name: mango, dtype: float64

## Merging DataFrames

It is often useful to combine separate, but related, datasets. This could be accomplished in SQL with a `JOIN` query, but it may be more convenient and efficient to do it directly with Pandas.


### Concatenating DataFrames

The most basic way to combine DataFrames is by essentially stacking them, using `pd.concat`.

The obvious use case here is to combine datasets that include different observations but have the same categories/column names.

In [121]:
# Sample data

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                        index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])


In [122]:
# create a list of the DataFrames to be concatenated

frames = [df1, df2, df3]

In [123]:
# assign the new df a name and pass the list to the pd.concat method

result = pd.concat(frames)

In [124]:
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [126]:
# it is also possible to concatenate DataFrames horizontally,
# though in our example above the outcome would be unsatisfactory

result2 = pd.concat(frames, axis=1)

In [127]:
result2

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1,A.2,B.2,C.2,D.2
0,A0,B0,C0,D0,,,,,,,,
1,A1,B1,C1,D1,,,,,,,,
2,A2,B2,C2,D2,,,,,,,,
3,A3,B3,C3,D3,,,,,,,,
4,,,,,A4,B4,C4,D4,,,,
5,,,,,A5,B5,C5,D5,,,,
6,,,,,A6,B6,C6,D6,,,,
7,,,,,A7,B7,C7,D7,,,,
8,,,,,,,,,A8,B8,C8,D8
9,,,,,,,,,A9,B9,C9,D9


### Merging DataFrames

To accomplish SQL-style joins, `pd.merge` is a very flexible approach.

In [129]:
# Sample data

left = pd.DataFrame({'key': ['dog', 'cat', 'fish', 'bird'],
                         'A': ['A0', 'A1', 'A2', 'A3'],
                         'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['bird', 'fish', 'cat', 'dog'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})

In [130]:
new_df = pd.merge(left, right, on='key')


In [131]:
new_df


Unnamed: 0,key,A,B,C,D
0,dog,A0,B0,C3,D3
1,cat,A1,B1,C2,D2
2,fish,A2,B2,C1,D1
3,bird,A3,B3,C0,D0


Joins using multiple keys are also possible, as in the following example:

In [132]:
# sample data

cities1 = pd.DataFrame({'city': ['Springfield', 'Springfield',
                                  'Dover', 'Chicago'],
                         'state': ['IL', 'OH', 'DE', 'IL'],
                         'A': ['A0', 'A1', 'A2', 'A3'],
                         'B': ['B0', 'B1', 'B2', 'B3']})

cities2 = pd.DataFrame({'city': ['Cleveland', 'Dover',
                                   'Springfield', 'Chicago'],
                          'state': ['OH', 'NH', 'IL', 'IL'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})

In [133]:
merged_cities = pd.merge(cities1, cities2, on=['city', 'state'])

In [134]:
merged_cities

Unnamed: 0,city,state,A,B,C,D
0,Springfield,IL,A0,B0,C2,D2
1,Chicago,IL,A3,B3,C3,D3


**IMPORTANT**
The `merge` defaults to an *inner* join, but it is possible to control this behavior with the `how` argument.

`how` --> defaults to `'inner'`, but `'left'`, `'right'`, and `'outer'` are also available.

| Merge method     	| SQL JOIN Name        	| Description                               	|
|------------------	|----------------------	|-------------------------------------------	|
| left             	| LEFT OUTER JOIN      	| Use keys from left frame only             	|
| right            	| RIGHT OUTER JOIN     	| Use keys from right frame only            	|
| outer            	| FULL OUTER JOIN      	| Use union of keys from both frames        	|
| inner            	| INNER JOIN           	| Use intersection of keys from both frames 	|

### Pandas Merge Challenge


In [5]:
'''Join the DataFrames below to return a new DataFrame of users with
listed birthdays, along with their addresses if you have them.'''



import pandas as pd
dobs = pd.DataFrame({'name': ['Suzy', 'Wei','Yulia', 'Arvind'],
                   'day': ['12', '19', '2', '23'],
                   'month': ['Dec', 'Nov', 'May', 'Jul']})

addresses = pd.DataFrame({'name': ['Marisol', 'Arvind','Stephan', 'Suzy'],
                     'city': ['San Francisco', 'Denver', 'Austin', 'Seattle'],
                     'state': ['CA', 'CO', 'TX', 'WA']})


birthday_address = pd.merge(dobs, addresses, on='name',how='left')

In [2]:
dobs


Unnamed: 0,name,day,month
0,Suzy,12,Dec
1,Wei,19,Nov
2,Yulia,2,May
3,Arvind,23,Jul


In [3]:
addresses

Unnamed: 0,name,city,state
0,Marisol,San Francisco,CA
1,Arvind,Denver,CO
2,Stephan,Austin,TX
3,Suzy,Seattle,WA


In [6]:
birthday_address


Unnamed: 0,name,day,month,city,state
0,Suzy,12,Dec,Seattle,WA
1,Wei,19,Nov,,
2,Yulia,2,May,,
3,Arvind,23,Jul,Denver,CO


### Split, Apply and Combine Data

Subsetting datasets can be useful to get a better understanding of the data during EDA and will likely be necessary during analysis.

This section covers the strategy known as *split-apply-combine*.

##### Overview of Split-Apply-Combine:
1. split data into subsets
2. apply some calculation(s) to each of those subsets
3. combine the results into a new dataset


#### Splitting data

The tool for doing this in Pandas is the `.groupby()` method.

This method is similar to a SQL `GROUP BY` statement, but has more flexibility and features.

The `.groupby` method returns a special `DataFrameGroupBy` object which has a number of useful methods of its own.

* `aggregate` returns a single value for each group, so it is useful for rolling groups up into summary statistics (e.g. mean). It iterates over *groups* and performs the operation called on the group as a whole.


* `transform` is for applying calculations to each observation, by iterating over *each row* and operating on each value indvidually.


* `filter` is for filtering *by group*. For example if you wanted to elminiate groups with a small number of observations, you would first use `.groupby` to subset the DataFrame into groups, then utilize `filter` to remove groups not meeting the minimum threshold. Every item in groups that pass are returned by '`filter`.


* `apply` is a general catch all that places no constraints on the type of data returned.

In [9]:
# example data

grocery = pd.DataFrame({'category':['produce', 'produce', 'meat',
                                        'meat', 'meat', 'cheese', 'cheese'],
                            'item':['celery', 'apple', 'ham', 'turkey',
                                    'lamb', 'cheddar', 'brie'],
                            'price':[.99, .49, 1.89, 4.34, 9.50, 6.25, 8.0]})

In [10]:
grocery


Unnamed: 0,category,item,price
0,produce,celery,0.99
1,produce,apple,0.49
2,meat,ham,1.89
3,meat,turkey,4.34
4,meat,lamb,9.5
5,cheese,cheddar,6.25
6,cheese,brie,8.0


In [11]:
grocery.groupby('category')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f4989266518>

In [12]:
grouped = grocery.groupby('category')

In [14]:
import numpy as np

In [15]:
grouped.aggregate(np.mean)

Unnamed: 0_level_0,price
category,Unnamed: 1_level_1
cheese,7.125
meat,5.243333
produce,0.74


**category** becomes the index of the above object

In [16]:
grouped.transform(lambda x: x - x.mean())

Unnamed: 0,price
0,0.25
1,-0.25
2,-3.353333
3,-0.903333
4,4.256667
5,-0.875
6,0.875


In [None]:
# The index above is the same as in the original df

In [21]:
# to filter the dataset by the number of entries in each group
# you can call the len() function

grouped.filter(lambda x: len(x)>2)

Unnamed: 0,category,item,price
2,meat,ham,1.89
3,meat,turkey,4.34
4,meat,lamb,9.5


In [18]:
grouped.filter(lambda x: len(x)<=2)

Unnamed: 0,category,item,price
0,produce,celery,0.99
1,produce,apple,0.49
5,cheese,cheddar,6.25
6,cheese,brie,8.0


If you want to re-index a subgroup, you can use the `.reset_index()` method, which re-indexes the subgroup and moves the original index into a regular comlumn.

In [22]:
grouped.filter(lambda x: len(x)<=2).reset_index()

Unnamed: 0,index,category,item,price
0,0,produce,celery,0.99
1,1,produce,apple,0.49
2,5,cheese,cheddar,6.25
3,6,cheese,brie,8.0


### Pandas Groupby Challenge

Use *split-apply-combine* in order to:

1. Remove all items in categories where the mean price in that category is less than $3.00.
    
2. Find the maximum values in each category for all features. (What does Pandas take to be the maximum value of the 'item' column?)

3. If the maximum price in a category is more than $3.00, reduce all prices in that category by 10%. Return a Series of the new price column.


In [24]:
import pandas as pd
import numpy as np

grocery = pd.DataFrame({'category':['produce', 'produce', 'meat',
                                    'meat', 'meat', 'cheese', 'cheese'],
                        'item':['celery', 'apple', 'ham', 'turkey',  'lamb',
                                'cheddar', 'brie'],
                        'price':[.99, .49, 1.89, 4.34, 9.50, 6.25, 8.0]})

grouped = grocery.groupby('category')

one_mean = grouped.filter(lambda x: np.mean(x)>3)

two_max = grouped.aggregate(lambda x: max(x))

three_round = grocery.groupby('category')["price"].transform(lambda x: (x * 0.9) if np.mean(x) > 3 else x)

In [25]:
grocery

Unnamed: 0,category,item,price
0,produce,celery,0.99
1,produce,apple,0.49
2,meat,ham,1.89
3,meat,turkey,4.34
4,meat,lamb,9.5
5,cheese,cheddar,6.25
6,cheese,brie,8.0


In [29]:
grouped = grocery.groupby('category')


In [33]:
grouped.filter(lambda x: np.mean(x)>3)

Unnamed: 0,category,item,price
2,meat,ham,1.89
3,meat,turkey,4.34
4,meat,lamb,9.5
5,cheese,cheddar,6.25
6,cheese,brie,8.0


In [47]:
grouped.aggregate(lambda x: max(x))

Unnamed: 0_level_0,item,price
category,Unnamed: 1_level_1,Unnamed: 2_level_1
cheese,cheddar,8.0
meat,turkey,9.5
produce,celery,0.99


In [49]:
grocery.groupby('category')["price"].transform(lambda x: (x * 0.9) if np.max(x) > 3 else x)

0    0.990
1    0.490
2    1.701
3    3.906
4    8.550
5    5.625
6    7.200
Name: price, dtype: float64

In [51]:
# to check results
cat_max = grocery.groupby('category')["price"].aggregate(lambda x: np.max(x))

In [52]:
cat_max

category
cheese     8.00
meat       9.50
produce    0.99
Name: price, dtype: float64