<p style="font-size:20px"><b>Pandas Dataframes</b></p>

In [1]:
import pandas as pd

<p style="font-size:20px">We can load files from the local system, or from over the Internet.</p>

In [4]:
#iris = pd.read_csv("iris.csv")                                    # A file I installed on my Jupyter
iris = pd.read_csv("https://raptor.kent.ac.uk/~ds756/Data/iris.csv")    # You can access this too
iris

Unnamed: 0,Mono,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,virginica
146,147,6.3,2.5,5.0,1.9,virginica
147,148,6.5,3.0,5.2,2.0,virginica
148,149,6.2,3.4,5.4,2.3,virginica


<p style="font-size:20px">Once loaded, we can briefly examine the data frame to get an idea of the data.</p>

<ul style="font-size:20px">
    <li> How many rows and columns </li>
    <li> Peek at the first few rows to sanity check. </li>
    <li> Basic statistics </li>
    <li> Verify expected column data types</li>
    <li> Confirm expected number of missing entries</li>
</ul>

In [3]:
iris.info ()   # We can see missing cells, data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Mono          150 non-null    int64  
 1   Sepal_Length  150 non-null    float64
 2   Sepal_Width   150 non-null    float64
 3   Petal_Length  150 non-null    float64
 4   Petal_Width   150 non-null    float64
 5   Species       150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [5]:
iris.shape     # (rows, columns)

(150, 6)

In [6]:
iris.head (5)  # First 5 rows

Unnamed: 0,Mono,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa


In [7]:
iris.dtypes  # The data types of the features (columns)

Mono              int64
Sepal_Length    float64
Sepal_Width     float64
Petal_Length    float64
Petal_Width     float64
Species          object
dtype: object

In [None]:
iris.describe () # quick statistics

Unnamed: 0,Mono,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.057333,3.758,1.199333
std,43.445368,0.828066,0.435866,1.765298,0.762238
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


<p style="font-size:20px">_____________________________________________________________________________________</p>
<p style="font-size:20px">There are occasions when we have data from another source that we have computed in Python.</p>  
<p style="font-size:20px">It is often desirable to convert it to a Pandas data frame.  Consider the dictionary below:</p>

In [9]:
VIXdict = { # VIX data from the CBOE
    "DAY0" : { "Date" : "2004-01-02", "Open" : 17.96, "High" : 18.68, "Low" : 17.54, "Close" : 18.22},
    "DAY1" : { "Date" : "2004-01-05", "Open" : 18.45, "High" : 18.49, "Low" : 17.44, "Close" : 17.49},
    "DAY2" : { "Date" : "2004-01-06", "Open" : 17.66, "High" : 17.67, "Low" : 16.19, "Close" : 16.73},
    "DAY3" : { "Date" : "2004-01-07", "Open" : 16.72, "High" : 16.75, "Low" : 15.5, "Close": 15.5 },
    "DAY4" : { "Date" : "2004-01-08", "Open" : 15.42, "High" : 15.68, "Low" : 15.32, "Close" : 15.61 },
    "DAY5" : { "Date" : "2004-01-09", "Open" : 16.15, "High" : 16.88, "Low" : 15.57, "Close" : 16.75}
}

In [10]:
type (VIXdict)

dict

<p style="font-size:20px">We can turn our dictionary in to a Pandas dataframe easily.</p>

In [11]:
VIX = pd.DataFrame(data=VIXdict)     # By default, dictionary keys are the DataFrame columns
VIX

Unnamed: 0,DAY0,DAY1,DAY2,DAY3,DAY4,DAY5
Date,2004-01-02,2004-01-05,2004-01-06,2004-01-07,2004-01-08,2004-01-09
Open,17.96,18.45,17.66,16.72,15.42,16.15
High,18.68,18.49,17.67,16.75,15.68,16.88
Low,17.54,17.44,16.19,15.5,15.32,15.57
Close,18.22,17.49,16.73,15.5,15.61,16.75


In [12]:
VIX = VIX.T                          # The transpose gives us what we want, each date is a row
VIX

Unnamed: 0,Date,Open,High,Low,Close
DAY0,2004-01-02,17.96,18.68,17.54,18.22
DAY1,2004-01-05,18.45,18.49,17.44,17.49
DAY2,2004-01-06,17.66,17.67,16.19,16.73
DAY3,2004-01-07,16.72,16.75,15.5,15.5
DAY4,2004-01-08,15.42,15.68,15.32,15.61
DAY5,2004-01-09,16.15,16.88,15.57,16.75


In [13]:
VIX.shape

(6, 5)

In [14]:
VIX.describe ()

Unnamed: 0,Date,Open,High,Low,Close
count,6,6.0,6.0,6.0,6.0
unique,6,6.0,6.0,6.0,6.0
top,2004-01-02,17.96,18.68,17.54,18.22
freq,1,1.0,1.0,1.0,1.0


<p style="font-size:20px">_______________________________________________________________________________________</p>
<p style="font-size:20px">Columns, also known as features, are the essence of the data frame.  Pandas treats them as first class objects.</p>

<ul style="font-size:20px">
    <li>A Pandas data frame is similar to a dictionary of columns.</li>
    <li>Performance will always be much faster if processing is column oriented.</li>
    <li>There are many ways of accessing and selecting columns.</li>
</ul>

In [15]:
feature = iris["Sepal_Width"]        # We can get a reference to an individual feature
print (type (feature))               # The Pandas type for a feature/column

<class 'pandas.core.series.Series'>


In [16]:
feature.max ()

4.4

In [17]:
iris["Sepal_Width"].max ()

4.4

In [18]:
iris.Sepal_Width.max ()

4.4

In [19]:
iris.columns  # What features are in my data frame

Index(['Mono', 'Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width',
       'Species'],
      dtype='object')

In [20]:
feature_name = iris.columns[1]
feature_name, iris[feature_name].mean ()

('Sepal_Length', 5.843333333333334)

In [21]:
iris = iris.drop (columns="Mono")  # We do not need this feature, so delete the column

In [22]:
for column in iris.columns:                        # iterate over all features
    if iris[column].dtype == "float64":            # only work with numbers, skip Species
        print (column, iris[column].mean ())       # print the mean

Sepal_Length 5.843333333333334
Sepal_Width 3.0573333333333337
Petal_Length 3.7580000000000005
Petal_Width 1.1993333333333336


In [23]:

VIX["Range"] = VIX.High - VIX.Low # We can add columns to an extant dataframe
VIX

Unnamed: 0,Date,Open,High,Low,Close,Range
DAY0,2004-01-02,17.96,18.68,17.54,18.22,1.14
DAY1,2004-01-05,18.45,18.49,17.44,17.49,1.05
DAY2,2004-01-06,17.66,17.67,16.19,16.73,1.48
DAY3,2004-01-07,16.72,16.75,15.5,15.5,1.25
DAY4,2004-01-08,15.42,15.68,15.32,15.61,0.36
DAY5,2004-01-09,16.15,16.88,15.57,16.75,1.31


In [41]:
c = VIX.High - VIX.Low # We can add columns to an
type(c)

pandas.core.series.Series

<p style="font-size:20px">_____________________________________________________________________________________</p>
<p style="font-size:20px"><b>Subsetting Data</b></p>

<p style="font-size:20px">Pandas offers support for creating subsets.  There are many reasons for creating subsets of the entire data frame.</p>

<ul style="font-size:20px">
    <li>Data can be enormous; make the problem tractable.</li>
    <li>A particular subset may be interesting.</li>
    <li>Discarding outliers or bad data.</li>
    <li>Testing, training and verifying.</li>
</ul>

<p style="font-size:20px">Pandas offers many ways of creating subsets of dataframes in the form of dataframes.</p>

<p style="font-size:20px">If you have used SQL before then recollecting SELECT is helpful here.</p>

<p style="font-size:20px">To find data we must first <b>index</b> the data.</p>

<p style="font-size:20px">Pandas will create an index when we designate a <b>key</b></p>

In [24]:
VIX.columns # Date is an ordinary feature.

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Range'], dtype='object')

In [25]:
VIX = VIX.set_index ("Date")
VIX                            # Notice that the Date column is now special

Unnamed: 0_level_0,Open,High,Low,Close,Range
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-01-02,17.96,18.68,17.54,18.22,1.14
2004-01-05,18.45,18.49,17.44,17.49,1.05
2004-01-06,17.66,17.67,16.19,16.73,1.48
2004-01-07,16.72,16.75,15.5,15.5,1.25
2004-01-08,15.42,15.68,15.32,15.61,0.36
2004-01-09,16.15,16.88,15.57,16.75,1.31


In [26]:
VIX.columns # Date has disappeared from the list of features.

Index(['Open', 'High', 'Low', 'Close', 'Range'], dtype='object')

In [27]:
VIX.index   # Date is now indexed so it is special.

Index(['2004-01-02', '2004-01-05', '2004-01-06', '2004-01-07', '2004-01-08',
       '2004-01-09'],
      dtype='object', name='Date')

In [28]:
VIXfull = pd.read_csv("http://raptor.kent.ac.uk/~ds756/Data/VIX.csv", index_col="Date") # We can also do it at load time
VIXfull.head (5)

Unnamed: 0_level_0,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-01-02,17.96,18.68,17.54,18.22
2004-01-05,18.45,18.49,17.44,17.49
2004-01-06,17.66,17.67,16.19,16.73
2004-01-07,16.72,16.75,15.5,15.5
2004-01-08,15.42,15.68,15.32,15.61


In [29]:
Close = VIX["Close"]         # We have seen columns subsets already
Close.head (5)

Date
2004-01-02    18.22
2004-01-05    17.49
2004-01-06    16.73
2004-01-07     15.5
2004-01-08    15.61
Name: Close, dtype: object

<p style="font-size:20px">_____________________________________________________________________________________</p>
<p style="font-size:20px">We can also create subsets more generally with Pandas by taking advantage of indices and different strategies.</p>
<p style="font-size:20px">We shall examine three principal means:</p>

<ul style="font-size:20px">
    <li>loc: Logical keys, the contents of the cells.</li>
    <li>iloc: Physical, the index/location of the cells.</li>
    <li>Slicing: Pyton Slicing</li>
</ul>

<p style="font-size:20px">______________________________________________________________________________________</p>
<p style="font-size:20px">loc(<i style="color:#FF0000";>Pattern to Match</i>)</p>

In [30]:
# VIX.loc[18.49] ERROR, we have not indexed that feature

VIX.loc["2004-01-06"]         # We have indexed VIX so we can also subset a row by Date


Open     17.66
High     17.67
Low      16.19
Close    16.73
Range     1.48
Name: 2004-01-06, dtype: object

In [31]:
VIX.loc["2004-01-06", "Close"]         # We can futher narrow the scope by specifiying a column

16.73

In [32]:
VIX.loc[["2004-01-06", "2004-01-08"]] # We can specify multiple searches simultaneously

Unnamed: 0_level_0,Open,High,Low,Close,Range
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-01-06,17.66,17.67,16.19,16.73,1.48
2004-01-08,15.42,15.68,15.32,15.61,0.36


In [33]:
VIX.loc[["2004-01-06", "2004-01-08"], ["Open", "Close"]]  # We can request a list of columns too

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-01-06,17.66,16.73
2004-01-08,15.42,15.61


In [34]:
VIX.loc[VIX["High"] > 18] # Conditional, session highs greater than 18

Unnamed: 0_level_0,Open,High,Low,Close,Range
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-01-02,17.96,18.68,17.54,18.22,1.14
2004-01-05,18.45,18.49,17.44,17.49,1.05


<p style="font-size:20px">______________________________________________________________________________________</p>
<p style="font-size:20px">iloc(<i style="color:#FF0000";>Index of Location</i>)</p>

In [35]:
VIX.iloc [5]          # The fifth element of the dataframe

Open     16.15
High     16.88
Low      15.57
Close    16.75
Range     1.31
Name: 2004-01-09, dtype: object

In [36]:
VIX.iloc[[1,2,4]]    # A list of indicies

Unnamed: 0_level_0,Open,High,Low,Close,Range
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-01-05,18.45,18.49,17.44,17.49,1.05
2004-01-06,17.66,17.67,16.19,16.73,1.48
2004-01-08,15.42,15.68,15.32,15.61,0.36


<p style="font-size:20px">______________________________________________________________________________________</p>

<p style="font-size:20px"><b>Slicing</b></p>

<p style="font-size:20px">An important feature of Python is called <i>slicing</i>.</p>

<p style="font-size:20px">It takes the form of <i>identifier</i>[start:end:increment], and we can use it with Pandas dataframes.</p>

In [37]:
VIX["2004-01-06":] # Everything after a date

Unnamed: 0_level_0,Open,High,Low,Close,Range
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-01-06,17.66,17.67,16.19,16.73,1.48
2004-01-07,16.72,16.75,15.5,15.5,1.25
2004-01-08,15.42,15.68,15.32,15.61,0.36
2004-01-09,16.15,16.88,15.57,16.75,1.31


In [38]:
VIX[:"2004-01-06"] # Everything before a date

Unnamed: 0_level_0,Open,High,Low,Close,Range
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-01-02,17.96,18.68,17.54,18.22,1.14
2004-01-05,18.45,18.49,17.44,17.49,1.05
2004-01-06,17.66,17.67,16.19,16.73,1.48


In [39]:
VIX["2004-01-06":"2004-01-10"] # Everything between the dates

Unnamed: 0_level_0,Open,High,Low,Close,Range
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-01-06,17.66,17.67,16.19,16.73,1.48
2004-01-07,16.72,16.75,15.5,15.5,1.25
2004-01-08,15.42,15.68,15.32,15.61,0.36
2004-01-09,16.15,16.88,15.57,16.75,1.31


In [40]:
VIX["2004-01-06":"2004-01-09":2] # Everything between the dates, stride of 2

Unnamed: 0_level_0,Open,High,Low,Close,Range
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-01-06,17.66,17.67,16.19,16.73,1.48
2004-01-08,15.42,15.68,15.32,15.61,0.36
