<a href="https://colab.research.google.com/github/SSSpock/skillspire/blob/main/skillspireDS_wk2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Data Structures

## Series

In [None]:
import numpy as np
import pandas as pd

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [None]:
s = pd.Series(data, index=index)

Here, data can be many different things:
a Python dict
an ndarray
a scalar value (like 5)

In [None]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [None]:
s_list = list(np.random.randn(5))

In [None]:
s

a   -0.232471
b   -0.984236
c    0.625574
d    0.065722
e   -0.510399
dtype: float64

In [None]:
# Series can be instantiated from dicts
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

Series acts very similarly to a ndarray and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.

In [None]:
s

a   -0.232471
b   -0.984236
c    0.625574
d    0.065722
e   -0.510399
dtype: float64

In [None]:
s[0]

s[:3]

s[s > s.median()]

s[[4,3,1]]

np.exp(s)

a    0.792573
b    0.373725
c    1.869319
d    1.067929
e    0.600256
dtype: float64

In [None]:
num_list = [2,3,4,5,6,7,8]
np.exp(num_list)

In [None]:
# series like arrays have a single Data Type
s.dtype

dtype('float64')

In [None]:
# A series is like a dict

s["a"]
s["e"] = 12
"e" in s
"f" in s

False

In [None]:
s

a    -0.232471
b    -0.984236
c     0.625574
d     0.065722
e    12.000000
dtype: float64

In [None]:
# Vectorized Operations and label alignment

s + s

s * 2

np.exp(s)

a         0.792573
b         0.373725
c         1.869319
d         1.067929
e    162754.791419
dtype: float64

In [None]:
# A series has advantages over an array.  Operations between series automatically align based on the label.

s[1:] + s[:-1]

a         NaN
b   -1.968471
c    1.251148
d    0.131443
e         NaN
dtype: float64

In [None]:
s[:-1] 

a   -0.232471
b   -0.984236
c    0.625574
d    0.065722
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

## Data Frames


# Object Creation

In [None]:
# This week we are focused on the Pandas Library
# Creating a Series by passing a list of values, letting pandas create a default integer index:
s = pd.Series([1, 3, 5, np.nan, 6, 8],name='A')
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: A, dtype: float64

In [None]:
# Creating a DataFrame by passing a NumPy array, with a datetime index using date_range() and labeled columns:

dates = pd.date_range("20130101", periods=12, freq='M')

dates


DatetimeIndex(['2013-01-31', '2013-02-28', '2013-03-31', '2013-04-30',
               '2013-05-31', '2013-06-30', '2013-07-31', '2013-08-31',
               '2013-09-30', '2013-10-31', '2013-11-30', '2013-12-31'],
              dtype='datetime64[ns]', freq='M')

In [None]:
df = pd.DataFrame(np.random.randn(12, 4), index=dates, columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
2013-01-31,0.16716,0.250198,0.429759,-0.545001
2013-02-28,0.139132,0.028686,2.082247,-0.330394
2013-03-31,-1.148611,0.607278,0.362423,0.251168
2013-04-30,-1.050622,-1.280798,-1.243981,0.425479
2013-05-31,-1.486733,-1.073512,0.814429,-0.456931
2013-06-30,-0.569038,-1.733963,-0.480121,-1.309932
2013-07-31,0.448859,-0.404789,0.046547,-1.610593
2013-08-31,-0.02078,-0.551156,-0.694305,-1.236035
2013-09-30,0.253043,-1.425884,-0.294122,-0.447534
2013-10-31,-2.477486,0.114603,1.851779,0.789953


In [None]:
df['A']

2013-01-31    0.167160
2013-02-28    0.139132
2013-03-31   -1.148611
2013-04-30   -1.050622
2013-05-31   -1.486733
2013-06-30   -0.569038
2013-07-31    0.448859
2013-08-31   -0.020780
2013-09-30    0.253043
2013-10-31   -2.477486
2013-11-30    1.696349
2013-12-31    1.771317
Freq: M, Name: A, dtype: float64

In [None]:
# Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [None]:
df2['E'] =='test'

0     True
1    False
2     True
3    False
Name: E, dtype: bool

In [None]:
# from dict of arrays/lists
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}


Unnamed: 0,0
one,"[1.0, 2.0, 3.0, 4.0]"
two,"[4.0, 3.0, 2.0, 1.0]"


In [None]:
data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])

data[:] = [(1, 2.0, "Hello"), (2, 3.0, "World")]

pd.DataFrame(data)

pd.DataFrame(data, index=["first", "second"])

pd.DataFrame(data, columns=["C", "A", "B"])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


In [None]:
np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])

array([(0, 0., b''), (0, 0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [None]:
# From a dict of tuples

pd.DataFrame(
    {
        ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
        ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
        ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
        ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
        ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
    }
)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


# Viewing Data

In [None]:
# Use DataFrame.head() and DataFrame.tail() to view the top and bottom rows of the frame respectively:

df.head()

df.tail()

df.index

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

In [None]:
df.to_numpy()

array([[ 0.16716028,  0.2501983 ,  0.42975897, -0.54500057],
       [ 0.13913235,  0.02868609,  2.08224726, -0.33039399],
       [-1.14861133,  0.60727825,  0.36242285,  0.251168  ],
       [-1.0506217 , -1.28079797, -1.24398093,  0.42547869],
       [-1.48673292, -1.0735125 ,  0.8144292 , -0.45693126],
       [-0.56903801, -1.73396292, -0.4801212 , -1.30993153],
       [ 0.44885932, -0.40478857,  0.04654686, -1.61059307],
       [-0.02077987, -0.5511564 , -0.69430532, -1.23603498],
       [ 0.25304336, -1.42588371, -0.29412196, -0.44753369],
       [-2.47748635,  0.11460333,  1.85177933,  0.78995345],
       [ 1.69634873,  1.76797712, -0.01502372,  0.80491372],
       [ 1.77131655, -0.51106564, -0.92663829, -0.4688123 ]])

In [None]:
# Get fast summary statistics
df.describe()

Unnamed: 0,A,B,C,D
count,12.0,12.0,12.0,12.0
mean,-0.189784,-0.351035,0.161083,-0.344476
std,1.242849,0.984139,1.028699,0.795717
min,-2.477486,-1.733963,-1.243981,-1.610593
25%,-1.075119,-1.125334,-0.533667,-0.717759
50%,0.059176,-0.457927,0.015762,-0.452232
75%,0.301997,0.148502,0.525927,0.294746
max,1.771317,1.767977,2.082247,0.804914


In [None]:
# Transpose your data
df.T

Unnamed: 0,2013-01-31,2013-02-28,2013-03-31,2013-04-30,2013-05-31,2013-06-30,2013-07-31,2013-08-31,2013-09-30,2013-10-31,2013-11-30,2013-12-31
A,0.16716,0.139132,-1.148611,-1.050622,-1.486733,-0.569038,0.448859,-0.02078,0.253043,-2.477486,1.696349,1.771317
B,0.250198,0.028686,0.607278,-1.280798,-1.073512,-1.733963,-0.404789,-0.551156,-1.425884,0.114603,1.767977,-0.511066
C,0.429759,2.082247,0.362423,-1.243981,0.814429,-0.480121,0.046547,-0.694305,-0.294122,1.851779,-0.015024,-0.926638
D,-0.545001,-0.330394,0.251168,0.425479,-0.456931,-1.309932,-1.610593,-1.236035,-0.447534,0.789953,0.804914,-0.468812


In [None]:
# Sort By an axis
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2013-12-31,1.771317,-0.511066,-0.926638,-0.468812
2013-11-30,1.696349,1.767977,-0.015024,0.804914
2013-10-31,-2.477486,0.114603,1.851779,0.789953
2013-09-30,0.253043,-1.425884,-0.294122,-0.447534
2013-08-31,-0.02078,-0.551156,-0.694305,-1.236035
2013-07-31,0.448859,-0.404789,0.046547,-1.610593
2013-06-30,-0.569038,-1.733963,-0.480121,-1.309932
2013-05-31,-1.486733,-1.073512,0.814429,-0.456931
2013-04-30,-1.050622,-1.280798,-1.243981,0.425479
2013-03-31,-1.148611,0.607278,0.362423,0.251168


In [None]:
# Sort by values
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-06-30,-0.569038,-1.733963,-0.480121,-1.309932
2013-09-30,0.253043,-1.425884,-0.294122,-0.447534
2013-04-30,-1.050622,-1.280798,-1.243981,0.425479
2013-05-31,-1.486733,-1.073512,0.814429,-0.456931
2013-08-31,-0.02078,-0.551156,-0.694305,-1.236035
2013-12-31,1.771317,-0.511066,-0.926638,-0.468812
2013-07-31,0.448859,-0.404789,0.046547,-1.610593
2013-02-28,0.139132,0.028686,2.082247,-0.330394
2013-10-31,-2.477486,0.114603,1.851779,0.789953
2013-01-31,0.16716,0.250198,0.429759,-0.545001


# Selection

While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc().

In [None]:
df['A']

2013-01-31    0.167160
2013-02-28    0.139132
2013-03-31   -1.148611
2013-04-30   -1.050622
2013-05-31   -1.486733
2013-06-30   -0.569038
2013-07-31    0.448859
2013-08-31   -0.020780
2013-09-30    0.253043
2013-10-31   -2.477486
2013-11-30    1.696349
2013-12-31    1.771317
Freq: M, Name: A, dtype: float64

In [None]:
df[0:3]

df['20130131':'20131031']

Unnamed: 0,A,B,C,D
2013-01-31,0.16716,0.250198,0.429759,-0.545001
2013-02-28,0.139132,0.028686,2.082247,-0.330394
2013-03-31,-1.148611,0.607278,0.362423,0.251168
2013-04-30,-1.050622,-1.280798,-1.243981,0.425479
2013-05-31,-1.486733,-1.073512,0.814429,-0.456931
2013-06-30,-0.569038,-1.733963,-0.480121,-1.309932
2013-07-31,0.448859,-0.404789,0.046547,-1.610593
2013-08-31,-0.02078,-0.551156,-0.694305,-1.236035
2013-09-30,0.253043,-1.425884,-0.294122,-0.447534
2013-10-31,-2.477486,0.114603,1.851779,0.789953


In [None]:
# Selection by Label
df.loc[dates[0]]

df.loc[:, 'A']

2013-01-31    0.167160
2013-02-28    0.139132
2013-03-31   -1.148611
2013-04-30   -1.050622
2013-05-31   -1.486733
2013-06-30   -0.569038
2013-07-31    0.448859
2013-08-31   -0.020780
2013-09-30    0.253043
2013-10-31   -2.477486
2013-11-30    1.696349
2013-12-31    1.771317
Freq: M, Name: A, dtype: float64

In [None]:
dates[0]

Timestamp('2013-01-31 00:00:00', freq='M')

In [None]:
df.loc["20130102":"20130104", ["A", "B"]]

In [None]:
df.loc["20130102", ["A", "B"]]

In [None]:
# Selecting by position
df.iloc[3:5, 0:2]

df.iloc[[1, 2, 4], [0, 2]]

df.iloc[1:3, :]

df.iloc[:, 1:3]

Unnamed: 0,B,C
2013-01-31,0.250198,0.429759
2013-02-28,0.028686,2.082247
2013-03-31,0.607278,0.362423
2013-04-30,-1.280798,-1.243981
2013-05-31,-1.073512,0.814429
2013-06-30,-1.733963,-0.480121
2013-07-31,-0.404789,0.046547
2013-08-31,-0.551156,-0.694305
2013-09-30,-1.425884,-0.294122
2013-10-31,0.114603,1.851779


In [None]:
# Boolean Indexing
df[df["A"] > 0]

df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-31,0.16716,0.250198,0.429759,
2013-02-28,0.139132,0.028686,2.082247,
2013-03-31,,0.607278,0.362423,0.251168
2013-04-30,,,,0.425479
2013-05-31,,,0.814429,
2013-06-30,,,,
2013-07-31,0.448859,,0.046547,
2013-08-31,,,,
2013-09-30,0.253043,,,
2013-10-31,,0.114603,1.851779,0.789953


In [None]:
# Boolean filtering

df2 = df.copy()

df2["E"] = ["one", "one", "two", "three", "four", "three"]

df2

df2[df2["E"].isin(["two", "four"])]

# Setting

In [None]:
# Setting a new column automatically ali9gns the data by the indexes

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

df["F"] = s1


Unnamed: 0,A,B,C,D,F
2013-01-31,0.16716,0.250198,0.429759,-0.545001,
2013-02-28,0.139132,0.028686,2.082247,-0.330394,
2013-03-31,-1.148611,0.607278,0.362423,0.251168,
2013-04-30,-1.050622,-1.280798,-1.243981,0.425479,
2013-05-31,-1.486733,-1.073512,0.814429,-0.456931,
2013-06-30,-0.569038,-1.733963,-0.480121,-1.309932,
2013-07-31,0.448859,-0.404789,0.046547,-1.610593,
2013-08-31,-0.02078,-0.551156,-0.694305,-1.236035,
2013-09-30,0.253043,-1.425884,-0.294122,-0.447534,
2013-10-31,-2.477486,0.114603,1.851779,0.789953,


In [None]:
# Setting a values by label
df.at[dates[0], "A"] = np.mean(df['A'])

In [None]:
df

Unnamed: 0,A,B,C,D,F
2013-01-31,-0.351035,0.250198,0.429759,-0.545001,
2013-02-28,0.139132,0.028686,2.082247,-0.330394,
2013-03-31,-1.148611,0.607278,0.362423,0.251168,
2013-04-30,-1.050622,-1.280798,-1.243981,0.425479,
2013-05-31,-1.486733,-1.073512,0.814429,-0.456931,
2013-06-30,-0.569038,-1.733963,-0.480121,-1.309932,
2013-07-31,0.448859,-0.404789,0.046547,-1.610593,
2013-08-31,-0.02078,-0.551156,-0.694305,-1.236035,
2013-09-30,0.253043,-1.425884,-0.294122,-0.447534,
2013-10-31,-2.477486,0.114603,1.851779,0.789953,


In [None]:
# Setting Values by position
df.iat[0, 1] = 0