In [2]:
import pandas as pd
import numpy as np

Pandas `DataFrames` - the Python equivalent of an Excel or SQL table which we'll use to store and analyze data

 <img src="../images/df-1_objectives.png">

## 1 - DataFrame Basics

### DataFrame Properties

 <img src="../images/df-2_properties.png">

### Creating a DataFrame

* You can `create a DataFrame` from a Python dictionary or NumPy array by using the Pandas `DataFrame()` function 
* You'll most likely `create a DataFrame` by reading in a flat file (csv, txt, or tsv) with Pandas `read_csv()` function

In [2]:
oil = pd.read_csv("../retail/oil.csv")

oil

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


In [3]:
oil.shape

(1218, 2)

In [4]:
oil.index 

RangeIndex(start=0, stop=1218, step=1)

In [5]:
oil.columns

Index(['date', 'dcoilwtico'], dtype='object')

In [6]:
# we can assign column names to the dataframe

oil.columns = ['price_date', 'oil_price']

In [7]:
oil.head()

Unnamed: 0,price_date,oil_price
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [8]:
oil.axes 

[RangeIndex(start=0, stop=1218, step=1),
 Index(['price_date', 'oil_price'], dtype='object')]

In [9]:
oil.dtypes 

price_date     object
oil_price     float64
dtype: object

## 2 - Exploring DataFrames

You can `explore a DataFrame` with these Pandas methods:

 <img src="../images/df-3_explore.png">

 Examples: 
 * `retail_df.head` 
 * `retail_df.tail(3)` - might be useful if we're working with timeseries data (last dates in my data)
 * `retail_df.sample()` - returns 1 row by default
 * `retail_df.sample(5, random_state=12345)`
 * `retail_df.info()` - get the shape, metadata of the dataframe
 * `retail_df.info(show_counts=True)`
 * `retail_df.describe()` 
 * `retail_df.describe(includ="all").round()`

In [3]:
oil = pd.read_csv("../retail/oil.csv")

oil.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21
6,2013-01-09,93.08
7,2013-01-10,93.81
8,2013-01-11,93.6
9,2013-01-14,94.27


In [6]:
# since we didn't have 1.7 million rows, these non-null counts get displayed by default
# there's some missing data in the oil_price column

oil.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


In [7]:
oil.describe()

Unnamed: 0,dcoilwtico
count,1175.0
mean,67.714366
std,25.630476
min,26.19
25%,46.405
50%,53.19
75%,95.66
max,110.62


In [8]:
oil.describe(include = 'all') # all the dates are unique

Unnamed: 0,date,dcoilwtico
count,1218,1175.0
unique,1218,
top,2013-01-01,
freq,1,
mean,,67.714366
std,,25.630476
min,,26.19
25%,,46.405
50%,,53.19
75%,,95.66


## 3 - Accessing & Dropping Data


 <img src="../images/df-5_accessing_series_operations.png">

In [9]:
oil = pd.read_csv("../retail/oil.csv")

oil.columns = ['date', 'oil price']

oil.head()

Unnamed: 0,date,oil price
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [10]:
oil.date 

0       2013-01-01
1       2013-01-02
2       2013-01-03
3       2013-01-04
4       2013-01-07
           ...    
1213    2017-08-25
1214    2017-08-28
1215    2017-08-29
1216    2017-08-30
1217    2017-08-31
Name: date, Length: 1218, dtype: object

In [11]:
oil["oil price"]

0         NaN
1       93.14
2       92.97
3       93.12
4       93.20
        ...  
1213    47.65
1214    46.40
1215    46.46
1216    45.96
1217    47.26
Name: oil price, Length: 1218, dtype: float64

In [12]:
oil.columns = ['date', 'price']

oil.price

0         NaN
1       93.14
2       92.97
3       93.12
4       93.20
        ...  
1213    47.65
1214    46.40
1215    46.46
1216    45.96
1217    47.26
Name: price, Length: 1218, dtype: float64

In [14]:
oil[['date', 'price']] # double brackets to get a dataframe

Unnamed: 0,date,price
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


### Accessing Data with `iloc()` and `.loc()`

 <img src="../images/df-4_accessing_loc.png">

In [15]:
oil = pd.read_csv("../retail/oil.csv")

oil.columns = ['date', 'price']

oil['euro_price'] = oil['price'] * 1.1 

oil.head()
    

Unnamed: 0,date,price,euro_price
0,2013-01-01,,
1,2013-01-02,93.14,102.454
2,2013-01-03,92.97,102.267
3,2013-01-04,93.12,102.432
4,2013-01-07,93.2,102.52


In [16]:
# access the first 10 rows of the dataframe

oil.iloc[:10, :] # equivalent to oil.iloc[:10]

Unnamed: 0,date,price,euro_price
0,2013-01-01,,
1,2013-01-02,93.14,102.454
2,2013-01-03,92.97,102.267
3,2013-01-04,93.12,102.432
4,2013-01-07,93.2,102.52
5,2013-01-08,93.21,102.531
6,2013-01-09,93.08,102.388
7,2013-01-10,93.81,103.191
8,2013-01-11,93.6,102.96
9,2013-01-14,94.27,103.697


In [18]:
# just grab the last column
# wrap that in brackets to get a dataframe

oil.iloc[:10, [-1]]

Unnamed: 0,euro_price
0,
1,102.454
2,102.267
3,102.432
4,102.52
5,102.531
6,102.388
7,103.191
8,102.96
9,103.697


In [19]:
# slicing works exactly like slicing any object in Python
# grab the second last column to the end

oil.iloc[:10, -2:]

Unnamed: 0,price,euro_price
0,,
1,93.14,102.454
2,92.97,102.267
3,93.12,102.432
4,93.2,102.52
5,93.21,102.531
6,93.08,102.388
7,93.81,103.191
8,93.6,102.96
9,94.27,103.697


In [20]:
oil.loc[:5] # stop point is going to be inclusive when doing label based indexing

Unnamed: 0,date,price,euro_price
0,2013-01-01,,
1,2013-01-02,93.14,102.454
2,2013-01-03,92.97,102.267
3,2013-01-04,93.12,102.432
4,2013-01-07,93.2,102.52
5,2013-01-08,93.21,102.531


In [21]:
oil.loc[:5, ['date', 'euro_price']]

Unnamed: 0,date,euro_price
0,2013-01-01,
1,2013-01-02,102.454
2,2013-01-03,102.267
3,2013-01-04,102.432
4,2013-01-07,102.52
5,2013-01-08,102.531


In [22]:
# the order will matter 

oil.loc[:5, ['euro_price', 'date']]

Unnamed: 0,euro_price,date
0,,2013-01-01
1,102.454,2013-01-02
2,102.267,2013-01-03
3,102.432,2013-01-04
4,102.52,2013-01-07
5,102.531,2013-01-08


In [23]:
oil.loc[:5, "date":"euro_price"]

Unnamed: 0,date,price,euro_price
0,2013-01-01,,
1,2013-01-02,93.14,102.454
2,2013-01-03,92.97,102.267
3,2013-01-04,93.12,102.432
4,2013-01-07,93.2,102.52
5,2013-01-08,93.21,102.531


### Dropping Rows & Columns

<img src="../images/df-6_drop_1.png">

<img src="../images/df-7_drop_2.png">

In [32]:
oil = pd.read_csv("../retail/oil.csv")

oil.columns = ['date', 'price']

oil['euro_price'] = oil['price'] * 1.1 

oil.head()
    

Unnamed: 0,date,price,euro_price
0,2013-01-01,,
1,2013-01-02,93.14,102.454
2,2013-01-03,92.97,102.267
3,2013-01-04,93.12,102.432
4,2013-01-07,93.2,102.52


In [33]:
# useful to create intermediate dataframes and keep our original dataframe intact
# since we can always get back our original dataframe

oil_euro = oil.drop('price', axis = 1) # axis = 1 means we're dropping a column

In [34]:
oil_euro.head()

Unnamed: 0,date,euro_price
0,2013-01-01,
1,2013-01-02,102.454
2,2013-01-03,102.267
3,2013-01-04,102.432
4,2013-01-07,102.52


In [35]:
oil.head()

Unnamed: 0,date,price,euro_price
0,2013-01-01,,
1,2013-01-02,93.14,102.454
2,2013-01-03,92.97,102.267
3,2013-01-04,93.12,102.432
4,2013-01-07,93.2,102.52


In [36]:
# dropping rows 
# drop a single row

oil_euro = oil.drop(1, axis = 0) # axis = 0 means we're dropping rows

oil_euro # seeing nonconsecutive indices

Unnamed: 0,date,price,euro_price
0,2013-01-01,,
2,2013-01-03,92.97,102.267
3,2013-01-04,93.12,102.432
4,2013-01-07,93.20,102.520
5,2013-01-08,93.21,102.531
...,...,...,...
1213,2017-08-25,47.65,52.415
1214,2017-08-28,46.40,51.040
1215,2017-08-29,46.46,51.106
1216,2017-08-30,45.96,50.556


In [37]:
oil_euro.reset_index()

Unnamed: 0,index,date,price,euro_price
0,0,2013-01-01,,
1,2,2013-01-03,92.97,102.267
2,3,2013-01-04,93.12,102.432
3,4,2013-01-07,93.20,102.520
4,5,2013-01-08,93.21,102.531
...,...,...,...,...
1212,1213,2017-08-25,47.65,52.415
1213,1214,2017-08-28,46.40,51.040
1214,1215,2017-08-29,46.46,51.106
1215,1216,2017-08-30,45.96,50.556


## 4 - Blank & Duplicate Values

### Identifying Duplicate Rows

The `.duplicated()` method `identifies duplicate rows` of data 
* Specify subset=column(s) to look for duplicates across a subset of columns
* example:

    ```python
    product_df.duplicated(subset='product')
    ```

### Dropping Duplicate Rows

The `.drop_duplicates()` method `drops duplicate rows` from a DataFrame
* Specify subset=column(s) to look for duplicates across a subset of columns
* example: 

    ```python
    product_df.drop_duplicates(subset='product', keep='last', ignore_index=True)
    ```

### Identifying Missing Data

You can `identify missing data` by column using `.isna()` and `.sum()` methods
* The `.info()` method can also help identify null values

### Handling Missing Data

<img src="../images/df-8_handling_missing_data.png">

## 5 - Soring & Filtering

### Filtering DataFrames

1 - You can `filter the rows in a DataFrame` by passing a logical test in the `.loc[]` accessor, just like filter a Series or a NumPy array.

For example: 

```python
retail_df.loc[retail_df["dates"] == '2016-10-28']
```

* _This filters the retail_df DataFrame and only returns rows where the date is equal to "2016-10-28"_



2 - You can `filter the columns in a DataFrame` by passing them into the `.loc[]` accessor as a list or a slice

For example: 

```python
retail_df.loc[retail_df["dates"] == '2016-10-28', ["date", "sales"]].head()
```

* _This filters the retail_df DataFrame to the columns selected, and only returns rows where the date is equal to "2016-10-28"_

3 - You can `apply multpile filters` by joining the logical tests within an `&` operator 
* Try creating a Boolean mask for creating filters with complex logic

For example: 

<img src="../images/df-9_filtering.png">




In [43]:
oil = pd.read_csv("../retail/oil.csv")

oil['benchmark'] = 100 

oil.head()

Unnamed: 0,date,dcoilwtico,benchmark
0,2013-01-01,,100
1,2013-01-02,93.14,100
2,2013-01-03,92.97,100
3,2013-01-04,93.12,100
4,2013-01-07,93.2,100


In [44]:
oil.describe()

Unnamed: 0,dcoilwtico,benchmark
count,1175.0,1218.0
mean,67.714366,100.0
std,25.630476,0.0
min,26.19,100.0
25%,46.405,100.0
50%,53.19,100.0
75%,95.66,100.0
max,110.62,100.0


In [45]:
oil.loc[oil['dcoilwtico'] > 100] 

Unnamed: 0,date,dcoilwtico,benchmark
131,2013-07-03,101.92,100
133,2013-07-05,103.09,100
134,2013-07-08,103.03,100
135,2013-07-09,103.46,100
136,2013-07-10,106.41,100
...,...,...,...
407,2014-07-24,102.76,100
408,2014-07-25,105.23,100
409,2014-07-28,105.68,100
410,2014-07-29,104.91,100


In [46]:
# compare across columns (against the benchmark column)

oil.loc[oil['dcoilwtico'] > oil['benchmark']]

Unnamed: 0,date,dcoilwtico,benchmark
131,2013-07-03,101.92,100
133,2013-07-05,103.09,100
134,2013-07-08,103.03,100
135,2013-07-09,103.46,100
136,2013-07-10,106.41,100
...,...,...,...
407,2014-07-24,102.76,100
408,2014-07-25,105.23,100
409,2014-07-28,105.68,100
410,2014-07-29,104.91,100


In [47]:
oil.loc[oil['date'].str[:4] == '2013']

Unnamed: 0,date,dcoilwtico,benchmark
0,2013-01-01,,100
1,2013-01-02,93.14,100
2,2013-01-03,92.97,100
3,2013-01-04,93.12,100
4,2013-01-07,93.20,100
...,...,...,...
256,2013-12-25,,100
257,2013-12-26,99.18,100
258,2013-12-27,99.94,100
259,2013-12-30,98.90,100


In [48]:
# combine our conditions and create a mask 

mask = ((oil['dcoilwtico'] > oil['benchmark']) & 
        (oil['date'].str[:4] == '2013'))

In [49]:
oil.loc[mask]

Unnamed: 0,date,dcoilwtico,benchmark
131,2013-07-03,101.92,100
133,2013-07-05,103.09,100
134,2013-07-08,103.03,100
135,2013-07-09,103.46,100
136,2013-07-10,106.41,100
...,...,...,...
204,2013-10-14,102.46,100
205,2013-10-15,101.15,100
206,2013-10-16,102.34,100
207,2013-10-17,100.72,100


In [50]:
# change to or

mask = ((oil['dcoilwtico'] > oil['benchmark']) |
        (oil['date'].str[:4] == '2013'))

In [51]:
oil.loc[mask]

Unnamed: 0,date,dcoilwtico,benchmark
0,2013-01-01,,100
1,2013-01-02,93.14,100
2,2013-01-03,92.97,100
3,2013-01-04,93.12,100
4,2013-01-07,93.20,100
...,...,...,...
407,2014-07-24,102.76,100
408,2014-07-25,105.23,100
409,2014-07-28,105.68,100
410,2014-07-29,104.91,100


### Filtering DataFrames

4 - **PRO TIP:** QUERY

The `.query()` method lets you use SQL-like syntax to filter DataFrames
* You can specify any number of filtering conditions by using the _`"and"`_ & _`"or"`_ keywords
* You can reference variables by using the _`"@"`_ symbol

<img src="../images/df-10_filtering_query_1.png">


<img src="../images/df-11_filtering_query_2.png">

In [54]:
oil = pd.read_csv("../retail/oil.csv", 
                  parse_dates = ['date']) # convert to datetime64 dtype

oil['benchmark'] = 100 

oil.head()

Unnamed: 0,date,dcoilwtico,benchmark
0,2013-01-01,,100
1,2013-01-02,93.14,100
2,2013-01-03,92.97,100
3,2013-01-04,93.12,100
4,2013-01-07,93.2,100


In [55]:
oil.head()

Unnamed: 0,date,dcoilwtico,benchmark
0,2013-01-01,,100
1,2013-01-02,93.14,100
2,2013-01-03,92.97,100
3,2013-01-04,93.12,100
4,2013-01-07,93.2,100


In [56]:
oil.dtypes

date          datetime64[ns]
dcoilwtico           float64
benchmark              int64
dtype: object

In [57]:
oil.query(
    'dcoilwtico > benchmark or date.dt.year == 2013'
)

Unnamed: 0,date,dcoilwtico,benchmark
0,2013-01-01,,100
1,2013-01-02,93.14,100
2,2013-01-03,92.97,100
3,2013-01-04,93.12,100
4,2013-01-07,93.20,100
...,...,...,...
407,2014-07-24,102.76,100
408,2014-07-25,105.23,100
409,2014-07-28,105.68,100
410,2014-07-29,104.91,100


### Sorting DataFrames

#### Sorting DataFrames By Indices

You can `sort a DataFrame by its indices` using the `.sort_index()` method
* This sorts rows(axis=0) by default, but you can specify axis=1 to sort the columns 

<img src="../images/df-12_sorting_1_index.png">

<img src="../images/df-13_sorting_2_index.png">

#### Sorting DataFrames By Values

You can `sort a DataFrame by its values` using the `.sort_values()` method
* You can sort by a single volumn or by multiple columns

<img src="../images/df-14_sorting_3_values.png">

#### Sorting Practice

In [58]:
oil 

Unnamed: 0,date,dcoilwtico,benchmark
0,2013-01-01,,100
1,2013-01-02,93.14,100
2,2013-01-03,92.97,100
3,2013-01-04,93.12,100
4,2013-01-07,93.20,100
...,...,...,...
1213,2017-08-25,47.65,100
1214,2017-08-28,46.40,100
1215,2017-08-29,46.46,100
1216,2017-08-30,45.96,100


In [59]:
oil.sort_index(ascending = False) # our last observation becomes the first, and vice versa

Unnamed: 0,date,dcoilwtico,benchmark
1217,2017-08-31,47.26,100
1216,2017-08-30,45.96,100
1215,2017-08-29,46.46,100
1214,2017-08-28,46.40,100
1213,2017-08-25,47.65,100
...,...,...,...
4,2013-01-07,93.20,100
3,2013-01-04,93.12,100
2,2013-01-03,92.97,100
1,2013-01-02,93.14,100


In [60]:
# Now we have the last/first column in alphabetical order at the beginning/end of the dataframe

oil.sort_index(ascending = False, axis=1) # sort columns

Unnamed: 0,dcoilwtico,date,benchmark
0,,2013-01-01,100
1,93.14,2013-01-02,100
2,92.97,2013-01-03,100
3,93.12,2013-01-04,100
4,93.20,2013-01-07,100
...,...,...,...
1213,47.65,2017-08-25,100
1214,46.40,2017-08-28,100
1215,46.46,2017-08-29,100
1216,45.96,2017-08-30,100


In [62]:
# sort by values
# create a "month" column

oil['month'] = oil['date'].astype('datetime64[ns]').dt.month

oil.head()

Unnamed: 0,date,dcoilwtico,benchmark,month
0,2013-01-01,,100,1
1,2013-01-02,93.14,100,1
2,2013-01-03,92.97,100,1
3,2013-01-04,93.12,100,1
4,2013-01-07,93.2,100,1


In [64]:
# sort by a single column
# NaN will always be at the bottom/end

oil.sort_values("dcoilwtico", ascending=False) 

Unnamed: 0,date,dcoilwtico,benchmark,month
178,2013-09-06,110.62,100,9
171,2013-08-28,110.17,100,8
179,2013-09-09,109.62,100,9
170,2013-08-27,109.11,100,8
182,2013-09-12,108.72,100,9
...,...,...,...,...
1079,2017-02-20,,100,2
1118,2017-04-14,,100,4
1149,2017-05-29,,100,5
1174,2017-07-03,,100,7


In [65]:
# sort by multiple columns (a list of columns)
# sorting from our earliest to latest month, and then looking at the highest to lowest prices within each month

# we saw the highest prices towards the end of each month 

oil.sort_values(['month', 'dcoilwtico'], ascending=[True, False])

Unnamed: 0,date,dcoilwtico,benchmark,month
282,2014-01-30,98.25,100,1
21,2013-01-30,97.98,100,1
22,2013-01-31,97.65,100,1
20,2013-01-29,97.62,100,1
283,2014-01-31,97.55,100,1
...,...,...,...,...
774,2015-12-21,34.55,100,12
256,2013-12-25,,100,12
517,2014-12-25,,100,12
778,2015-12-25,,100,12


## 6 - Modifying Columns 

### Renaming Columns

* `Rename columns` in place via assignment using the "columns" property 

<img src="../images/df-15_rename_1.png">

* You can `rename columns` with the `.rename()` method 

<img src="../images/df-16_rename_2.png">

### Reording Columns

* `Reorder columns` with the `.reindex()` method when sorting won't suffice

<img src="../images/df-17_reorder.png">

### Arithmetic Column Creation

You can `create columns with arithmetic` by assigning them Series operations
* Simply specify the new column name and assign the operation of interest
* The new columns are added to the end of the DataFrame by default

    ```python

    baby_books["tax_amount"] = baby_books["sales"] * 0.05

    baby_books["total"] = baby_books["sales"] + baby_books["tax_amount"]
    ```


### Boolean Column Creation

You can `create Boolean columns` by assigning them a logical test

<img src="../images/df-18_boolean_column_creation.png">

#### Practice

In [67]:
oil['benchmark'] = 90 

oil 

Unnamed: 0,date,dcoilwtico,benchmark,month
0,2013-01-01,,90,1
1,2013-01-02,93.14,90,1
2,2013-01-03,92.97,90,1
3,2013-01-04,93.12,90,1
4,2013-01-07,93.20,90,1
...,...,...,...,...
1213,2017-08-25,47.65,90,8
1214,2017-08-28,46.40,90,8
1215,2017-08-29,46.46,90,8
1216,2017-08-30,45.96,90,8


In [69]:
oil['benchmark_ratio'] = (oil.loc[:, 'dcoilwtico'] / oil.loc[:, 'benchmark']) * 100 

oil 

Unnamed: 0,date,dcoilwtico,benchmark,month,benchmark_ratio
0,2013-01-01,,90,1,
1,2013-01-02,93.14,90,1,103.488889
2,2013-01-03,92.97,90,1,103.300000
3,2013-01-04,93.12,90,1,103.466667
4,2013-01-07,93.20,90,1,103.555556
...,...,...,...,...,...
1213,2017-08-25,47.65,90,8,52.944444
1214,2017-08-28,46.40,90,8,51.555556
1215,2017-08-29,46.46,90,8,51.622222
1216,2017-08-30,45.96,90,8,51.066667


In [71]:
# Maybe you want to buy oil at a given price 

oil.loc[:, 'benchmark_ratio'] < 80 

0       False
1       False
2       False
3       False
4       False
        ...  
1213     True
1214     True
1215     True
1216     True
1217     True
Name: benchmark_ratio, Length: 1218, dtype: bool

In [73]:
oil["buy"] = oil.loc[:, 'benchmark_ratio'] < 80 

oil

Unnamed: 0,date,dcoilwtico,benchmark,month,benchmark_ratio,buy
0,2013-01-01,,90,1,,False
1,2013-01-02,93.14,90,1,103.488889,False
2,2013-01-03,92.97,90,1,103.300000,False
3,2013-01-04,93.12,90,1,103.466667,False
4,2013-01-07,93.20,90,1,103.555556,False
...,...,...,...,...,...,...
1213,2017-08-25,47.65,90,8,52.944444,True
1214,2017-08-28,46.40,90,8,51.555556,True
1215,2017-08-29,46.46,90,8,51.622222,True
1216,2017-08-30,45.96,90,8,51.066667,True


In [75]:
# we have a budget of $1,000,000 to buy oil

oil['buy'] = (oil.loc[:, 'benchmark_ratio'] < 80) * (1000000 / oil.loc[:, 'dcoilwtico'])

oil

Unnamed: 0,date,dcoilwtico,benchmark,month,benchmark_ratio,buy
0,2013-01-01,,90,1,,
1,2013-01-02,93.14,90,1,103.488889,0.000000
2,2013-01-03,92.97,90,1,103.300000,0.000000
3,2013-01-04,93.12,90,1,103.466667,0.000000
4,2013-01-07,93.20,90,1,103.555556,0.000000
...,...,...,...,...,...,...
1213,2017-08-25,47.65,90,8,52.944444,20986.358867
1214,2017-08-28,46.40,90,8,51.555556,21551.724138
1215,2017-08-29,46.46,90,8,51.622222,21523.891520
1216,2017-08-30,45.96,90,8,51.066667,21758.050479


### **PRO TIP:** NUMPY SELECT - Advanced Conditional Columns with Select

NumPy's `select()` function lets you create columns based on multiple conditions 
* This is more flexible than NumPy's `where()` function or Pandas' `.where()` method

<img src="../images/df-19_numpy_select.png">

#### Practice

In [76]:
oil = pd.read_csv("../retail/oil.csv")

oil.columns = ['date', 'price']

oil.head()

Unnamed: 0,date,price
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [78]:
# take a look at the use of where

oil["buy"] = np.where(oil["price"] > 100, "Too High", "Buy")

oil 

Unnamed: 0,date,price,buy
0,2013-01-01,,Buy
1,2013-01-02,93.14,Buy
2,2013-01-03,92.97,Buy
3,2013-01-04,93.12,Buy
4,2013-01-07,93.20,Buy
...,...,...,...
1213,2017-08-25,47.65,Buy
1214,2017-08-28,46.40,Buy
1215,2017-08-29,46.46,Buy
1216,2017-08-30,45.96,Buy


In [79]:
# use value_counts to summarize the number of observations in each category

oil["buy"].value_counts()

buy
Buy         1036
Too High     182
Name: count, dtype: int64

In [80]:
# Use NumPy Select to create more categories

conditions = [
    (oil["price"] > 100),
    (oil["price"] <= 100) & (oil["price"] > 50),
    (oil["price"] <= 50)
]

choices = ["Don't Buy", "Buy", "Strong Buy"]

oil["buy"] = np.select(conditions, choices, default = "Missing")

oil.head()

Unnamed: 0,date,price,buy
0,2013-01-01,,Missing
1,2013-01-02,93.14,Buy
2,2013-01-03,92.97,Buy
3,2013-01-04,93.12,Buy
4,2013-01-07,93.2,Buy


In [82]:
oil.tail()

Unnamed: 0,date,price,buy
1213,2017-08-25,47.65,Strong Buy
1214,2017-08-28,46.4,Strong Buy
1215,2017-08-29,46.46,Strong Buy
1216,2017-08-30,45.96,Strong Buy
1217,2017-08-31,47.26,Strong Buy


In [83]:
oil["buy"].value_counts()

buy
Buy           512
Strong Buy    481
Don't Buy     182
Missing        43
Name: count, dtype: int64

### The MAP Method - Mapping Values to Columns

The `.map()` method `maps values to column` or an entire DataFrame
* You can pass a dictionary with existing values as the keys, and new values as the values

<img src="../images/df-20_map_method_1.png">

* You can apply lambda functions (_and others_)

<img src="../images/df-21_map_method_2.png">

#### Practice

In [87]:
retail = pd.read_csv("../retail/retail_2016_2017.csv")

retail.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [89]:
retail.loc[:, 'family'].value_counts()

family
AUTOMOTIVE                    31968
HOME APPLIANCES               31968
SCHOOL AND OFFICE SUPPLIES    31968
PRODUCE                       31968
PREPARED FOODS                31968
POULTRY                       31968
PLAYERS AND ELECTRONICS       31968
PET SUPPLIES                  31968
PERSONAL CARE                 31968
MEATS                         31968
MAGAZINES                     31968
LIQUOR,WINE,BEER              31968
LINGERIE                      31968
LAWN AND GARDEN               31968
LADIESWEAR                    31968
HOME CARE                     31968
HOME AND KITCHEN II           31968
BABY CARE                     31968
HOME AND KITCHEN I            31968
HARDWARE                      31968
GROCERY II                    31968
GROCERY I                     31968
FROZEN FOODS                  31968
EGGS                          31968
DELI                          31968
DAIRY                         31968
CLEANING                      31968
CELEBRATION          

In [90]:
product_category_dict = {
    'PRODUCE': 'Grocery',
    'POULTRY': 'Grocery',
    'GROCERY I': 'Grocery',
    'GROCERY II': 'Grocery',
    'EGGS': 'Grocery',
}

In [92]:
retail.loc[:, 'family'].map(product_category_dict)

0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
            ...   
1054939    Grocery
1054940        NaN
1054941    Grocery
1054942        NaN
1054943        NaN
Name: family, Length: 1054944, dtype: object

In [94]:
retail.loc[:, 'family'].map(product_category_dict).value_counts(dropna=False)

family
NaN        895104
Grocery    159840
Name: count, dtype: int64

### **PRO TIP:** Column Creation with Assign

The `.assign()` method `creates multiple columns` at once and returns a DataFrame
* This can be chained together with other data processing methods

<img src="../images/df-22_assign_1.png">

#### Practice

In [95]:
sample_df = retail.sample(5, random_state=85)

sample_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
383220,2329164,2016-08-03,11,MEATS,852.20404,0
883307,2829251,2017-05-11,42,PREPARED FOODS,53.282,2
656822,2602766,2017-01-04,38,MAGAZINES,4.0,0
389404,2335348,2016-08-06,35,BOOKS,0.0,0
216577,2162521,2016-05-01,35,SCHOOL AND OFFICE SUPPLIES,49.0,7


In [96]:
sample_df.assign(
    onpromotion_flag = sample_df["onpromotion"] > 0 
)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,onpromotion_flag
383220,2329164,2016-08-03,11,MEATS,852.20404,0,False
883307,2829251,2017-05-11,42,PREPARED FOODS,53.282,2,True
656822,2602766,2017-01-04,38,MAGAZINES,4.0,0,False
389404,2335348,2016-08-06,35,BOOKS,0.0,0,False
216577,2162521,2016-05-01,35,SCHOOL AND OFFICE SUPPLIES,49.0,7,True


In [97]:
sample_df.assign(
    onpromotion_flag = sample_df["onpromotion"] > 0,
    family_abbrev = sample_df["family"].str[:3]
)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,onpromotion_flag,family_abbrev
383220,2329164,2016-08-03,11,MEATS,852.20404,0,False,MEA
883307,2829251,2017-05-11,42,PREPARED FOODS,53.282,2,True,PRE
656822,2602766,2017-01-04,38,MAGAZINES,4.0,0,False,MAG
389404,2335348,2016-08-06,35,BOOKS,0.0,0,False,BOO
216577,2162521,2016-05-01,35,SCHOOL AND OFFICE SUPPLIES,49.0,7,True,SCH


In [98]:
sample_df.assign(
    onpromotion_flag = sample_df["onpromotion"] > 0,
    family_abbrev = sample_df["family"].str[:3],
    onpromotion_ratio = sample_df["sales"] / sample_df["onpromotion"]
)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,onpromotion_flag,family_abbrev,onpromotion_ratio
383220,2329164,2016-08-03,11,MEATS,852.20404,0,False,MEA,inf
883307,2829251,2017-05-11,42,PREPARED FOODS,53.282,2,True,PRE,26.641
656822,2602766,2017-01-04,38,MAGAZINES,4.0,0,False,MAG,inf
389404,2335348,2016-08-06,35,BOOKS,0.0,0,False,BOO,
216577,2162521,2016-05-01,35,SCHOOL AND OFFICE SUPPLIES,49.0,7,True,SCH,7.0


In [99]:
sample_df.assign(
    onpromotion_flag = sample_df["onpromotion"] > 0,
    family_abbrev = sample_df["family"].str[:3],
    onpromotion_ratio = sample_df["sales"] / sample_df["onpromotion"],

    # reference columns created within the same assign statement
    sales_onprom_target = lambda x: x["onpromotion_ratio"] > 100 
)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,onpromotion_flag,family_abbrev,onpromotion_ratio,sales_onprom_target
383220,2329164,2016-08-03,11,MEATS,852.20404,0,False,MEA,inf,True
883307,2829251,2017-05-11,42,PREPARED FOODS,53.282,2,True,PRE,26.641,False
656822,2602766,2017-01-04,38,MAGAZINES,4.0,0,False,MAG,inf,True
389404,2335348,2016-08-06,35,BOOKS,0.0,0,False,BOO,,False
216577,2162521,2016-05-01,35,SCHOOL AND OFFICE SUPPLIES,49.0,7,True,SCH,7.0,False


In [100]:
# Add a filter using query()

sample_df.assign(
    onpromotion_flag = sample_df["onpromotion"] > 0,
    family_abbrev = sample_df["family"].str[:3],
    onpromotion_ratio = sample_df["sales"] / sample_df["onpromotion"],

    # reference columns created within the same assign statement
    sales_onprom_target = lambda x: x["onpromotion_ratio"] > 100 
).query("sales_onprom_target == True")

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,onpromotion_flag,family_abbrev,onpromotion_ratio,sales_onprom_target
383220,2329164,2016-08-03,11,MEATS,852.20404,0,False,MEA,inf,True
656822,2602766,2017-01-04,38,MAGAZINES,4.0,0,False,MAG,inf,True


## 7 - Pandas Data Types 

<img src="../images/df-23_pandas_data_types.png">

### The Categorical Data Type

<img src="../images/df-24_categorical_data_type.png">

In [101]:
retail = pd.read_csv("../retail/retail_2016_2017.csv")

retail.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 167.8 MB


In [102]:
retail.loc[:, 'family'].value_counts()

family
AUTOMOTIVE                    31968
HOME APPLIANCES               31968
SCHOOL AND OFFICE SUPPLIES    31968
PRODUCE                       31968
PREPARED FOODS                31968
POULTRY                       31968
PLAYERS AND ELECTRONICS       31968
PET SUPPLIES                  31968
PERSONAL CARE                 31968
MEATS                         31968
MAGAZINES                     31968
LIQUOR,WINE,BEER              31968
LINGERIE                      31968
LAWN AND GARDEN               31968
LADIESWEAR                    31968
HOME CARE                     31968
HOME AND KITCHEN II           31968
BABY CARE                     31968
HOME AND KITCHEN I            31968
HARDWARE                      31968
GROCERY II                    31968
GROCERY I                     31968
FROZEN FOODS                  31968
EGGS                          31968
DELI                          31968
DAIRY                         31968
CLEANING                      31968
CELEBRATION          

In [103]:
retail = retail.astype({"family": "category"})

retail.dtypes

id                int64
date             object
store_nbr         int64
family         category
sales           float64
onpromotion       int64
dtype: object

In [104]:
retail.loc[:, 'family'].value_counts()

family
AUTOMOTIVE                    31968
HOME APPLIANCES               31968
SCHOOL AND OFFICE SUPPLIES    31968
PRODUCE                       31968
PREPARED FOODS                31968
POULTRY                       31968
PLAYERS AND ELECTRONICS       31968
PET SUPPLIES                  31968
PERSONAL CARE                 31968
MEATS                         31968
MAGAZINES                     31968
LIQUOR,WINE,BEER              31968
LINGERIE                      31968
LAWN AND GARDEN               31968
LADIESWEAR                    31968
HOME CARE                     31968
HOME AND KITCHEN II           31968
BABY CARE                     31968
HOME AND KITCHEN I            31968
HARDWARE                      31968
GROCERY II                    31968
GROCERY I                     31968
FROZEN FOODS                  31968
EGGS                          31968
DELI                          31968
DAIRY                         31968
CLEANING                      31968
CELEBRATION          

In [105]:
retail.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype   
---  ------       --------------    -----   
 0   id           1054944 non-null  int64   
 1   date         1054944 non-null  object  
 2   store_nbr    1054944 non-null  int64   
 3   family       1054944 non-null  category
 4   sales        1054944 non-null  float64 
 5   onpromotion  1054944 non-null  int64   
dtypes: category(1), float64(1), int64(3), object(1)
memory usage: 100.6 MB


### Type Conversion

<img src="../images/df-25_type_conversion_1.png">

<img src="../images/df-26_type_conversion_2.png">

## 8 - Memory Optimization

### **PRO TIP:** Memory Optimization

<img src="../images/df-27_memory_optimization_0.png">

# Key Takeaways

<img src="../images/df-28_key_takeaways.png">