##### <b> DataFrames </b></br> Python equivalent to Excel Spreadsheet/SQL Table </br> Each Columns in a DataFrame is a pandas series </br> to improve code readability best to include `_df` for DataFrames or something that indicates it is a Dataframe like `_data`

In [1]:
import pandas as pd
import numpy as np

##### <b> Axis Values </b></br> - axis=0: This is for rows </br> To sum across Rows - df.sum(axis=0) </br> - axis=1: This is for columns </br> To sum down columns - df.sum(axis=1)

##### DataFrame Properties
| Properties| Description |
|----------|--------------|
| `shape`  | Number of rows and columns in a DataFrame (`index in not considered a column`)|
| `index`  | The row index in a DataFrame, by default it si a rnage of integers (`axis=0`)|
| `columns`| The column index in a DataFrame, represented by the Series names (`axis=1`)|
| `axes`   | The row and column indices in a DataFrame|
| `dtypes` | The data type for each Series in a DataFrame (`which can be different`)|



In [2]:
# load data from retail_2016_2017 csv
# there are a lot of functions that can be used when reading the csv into a DataFrame
# common practice is to create a path variable to call when reading in data
# instead of -> retail_df = pd.read_csv('Pandas Course Resources/retail/retail_2016_2017.csv')
# use
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.000,0
1,1945945,2016-01-01,1,BABY CARE,0.000,0
2,1945946,2016-01-01,1,BEAUTY,0.000,0
3,1945947,2016-01-01,1,BEVERAGES,0.000,0
4,1945948,2016-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
1054939,3000883,2017-08-15,9,POULTRY,438.133,0
1054940,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [3]:
# number of rows and columns
retail_df.shape

(1054944, 6)

In [4]:
# row index of DataFrame
retail_df.index

RangeIndex(start=0, stop=1054944, step=1)

In [5]:
# list of the DataFrame columns
retail_df.columns

Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')

In [6]:
# this retrieves information of axis=0 and axis=1 at the same time
retail_df.axes

[RangeIndex(start=0, stop=1054944, step=1),
 Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')]

In [7]:
# the datatypes and these may need to be changed
retail_df.dtypes

id               int64
date            object
store_nbr        int64
family          object
sales          float64
onpromotion      int64
dtype: object

##### <b> Creating a DataFrame </b></br> Can create a DataFrame from a Python Dictionary {`Key`:`Value`} or NumPy array `np.array([])` using Pandas DataFrame() method </br> - pd.DataFrame(`Dictionary` or `np.array`)

In [8]:
# Creation using Python Dictionary (keys and values) - rarely done from dictionary as an analyst
# keys are columns names and values are the column values
pd.DataFrame(
    {'id': [1, 2],
    'store_nbr': [1, 2],
    'family': ['POULTRY', 'PRODUCE']        
    }
)

Unnamed: 0,id,store_nbr,family
0,1,1,POULTRY
1,2,2,PRODUCE


In [9]:
# creating DataFrame from oil.csv
# common practice is to create a path variable to call when reading in data
# instead of -> oil_df = pd.read_csv('Pandas Course Resources/retail/oil.csv')
# use
oil_path = 'Pandas Course Resources/retail/oil.csv'
oil_df = pd.read_csv(oil_path)
oil_df

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


In [10]:
# relabeling columns
oil_df.columns = ['price_date', 'oil_price']
oil_df

Unnamed: 0,price_date,oil_price
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


In [11]:
# check oil_df axes
oil_df.axes

[RangeIndex(start=0, stop=1218, step=1),
 Index(['price_date', 'oil_price'], dtype='object')]

In [12]:
# check oil_df datatypes
oil_df.dtypes

price_date     object
oil_price     float64
dtype: object

##### <b> Exploring a DataFrame </b>
|Method|Descriptions|
|------|------------|
|`head`| Returns the first n rows of the DataFrame (`Default is 5`) - `df.head(nrows)`|
|`tail`| Returns the last n rows of the DataFrame (`Default is 5`) - `df.tail(nrows)`|
|`sample`| Returns n rows from a random sample (`Default is 1`) - `df.sample(nrows)`|
|`info`| Returns key details on DataFrame size, columns, and memory usage - `df.info()`|
|`describe`| Returns descriptive statistics for the columns in a DataFrame (`only numeric columns by default`; use the `include` argument to specify more columns) - `df.describe(include)`|

In [13]:
# tail and head are great for QA of data upon import
# great to verify if columns headers have been read in or if they need to be relabelled
retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [14]:
# tail is great for timeseries to look at the last date 
retail_df.tail()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
1054939,3000883,2017-08-15,9,POULTRY,438.133,0
1054940,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.0,8
1054943,3000887,2017-08-15,9,SEAFOOD,16.0,0


In [15]:
# random sample out of DataFrame which is built off NumPy rng and use random_state= which is the seed used for reproducibility
retail_df.sample(5, random_state=616)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
399033,2344977,2016-08-11,54,PRODUCE,487.239,1
579626,2525570,2016-11-21,22,HARDWARE,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0
534555,2480499,2016-10-26,8,LINGERIE,7.0,0
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0
