## Intro to Data Structures

### Series 

Series is a one-dimensional labeled array capable of holding any data type. 

The axis labels are collectively referred to as the index.

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
import dash 
import plotly

In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.172406
b   -1.912867
c   -1.327210
d   -2.012826
e    2.130367
dtype: float64

In [3]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [4]:
pd.Series(np.random.randn(5))

0    0.027712
1   -0.850258
2   -2.468694
3   -0.569333
4   -1.079133
dtype: float64

In [5]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [6]:
d = {'a': 0.0, 'b': 1.0, 'c': 2.0}

pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [7]:
pd.Series(d, index=['b', 'c', 'd', 'a'])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

#### Series is ndarray-like 

Series acts very similarly to a ndarray and is a valid argument to must NumPy functions.

However, operations such as slicing will also slice the index. 

In [8]:
s[0]

0.17240622003222092

In [9]:
s[:3]

a    0.172406
b   -1.912867
c   -1.327210
dtype: float64

In [10]:
s[s > s.median()]

a    0.172406
e    2.130367
dtype: float64

In [11]:
s[[4, 3, 1]]

e    2.130367
d   -2.012826
b   -1.912867
dtype: float64

In [13]:
np.sqrt(np.abs(s))

a    0.415218
b    1.383064
c    1.152046
d    1.418741
e    1.459578
dtype: float64

#### Series is Dict-Like

A Series is also like a fixed-size dict in that you can get and set values by index label:

In [14]:
s['a']

0.17240622003222092

In [15]:
s['e'] = 12.0

In [16]:
s

a     0.172406
b    -1.912867
c    -1.327210
d    -2.012826
e    12.000000
dtype: float64

In [17]:
'e' in s

True

In [18]:
'f' in s

False

In [19]:
s.get('f')

In [20]:
s.get('f', np.nan)

nan

#### Vectorized Operations and Label Alignment with Series 

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. 

The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [21]:
s + s

a     0.344812
b    -3.825733
c    -2.654420
d    -4.025653
e    24.000000
dtype: float64

In [22]:
s * 2

a     0.344812
b    -3.825733
c    -2.654420
d    -4.025653
e    24.000000
dtype: float64

In [23]:
np.sqrt(np.abs(s))

a    0.415218
b    1.383064
c    1.152046
d    1.418741
e    3.464102
dtype: float64

In [24]:
np.exp(s)

a         1.188160
b         0.147656
c         0.265216
d         0.133611
e    162754.791419
dtype: float64

In [25]:
s

a     0.172406
b    -1.912867
c    -1.327210
d    -2.012826
e    12.000000
dtype: float64

## DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. 

You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of inputs.

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. 

In [26]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}

In [27]:
df = pd.DataFrame(d)

In [28]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [29]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [30]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


In [31]:
d = {'one': [1.0, 2.0, 3.0, 4.0], 'two': [4.0, 3.0, 2.0, 1.0]}

pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [32]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


# Column Selection, Addition, Deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects.

Getting, setting, and deleting columns works with the same syntax as the analogous dict operations.

In [33]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [34]:
d

{'one': [1.0, 2.0, 3.0, 4.0], 'two': [4.0, 3.0, 2.0, 1.0]}

In [35]:
df['three'] = df['one'] * df['two']

In [36]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [37]:
df['flag'] = df['one'] > 2

In [38]:
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,,4.0,,False


In [39]:
df['foo'] = 'bar'

In [40]:
df

Unnamed: 0,one,two,three,flag,foo
a,1.0,1.0,1.0,False,bar
b,2.0,2.0,4.0,False,bar
c,3.0,3.0,9.0,True,bar
d,,4.0,,False,bar


In [41]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])

In [42]:
df + df2

Unnamed: 0,A,B,C,D
0,-2.359069,0.385202,-0.570564,
1,-1.381312,2.259423,-0.479274,
2,-0.092083,1.047523,-1.657126,
3,-2.023153,0.252708,-0.820997,
4,0.176632,0.360403,-3.672431,
5,0.862704,0.822305,1.025391,
6,-0.900697,0.413308,-1.711544,
7,,,,
8,,,,
9,,,,


In [43]:
df - df.iloc[0]

Unnamed: 0,A,B,C,D
0,0.0,0.0,0.0,0.0
1,-0.339856,1.455582,-1.462384,-2.667201
2,2.001832,-0.797266,-1.292666,-0.091214
3,1.384365,-0.270054,-1.242006,-2.549234
4,2.788528,-0.357025,-2.684149,-1.172831
5,2.221887,0.512074,-1.332705,-1.394735
6,-0.2829,-0.052856,-0.484923,-2.57022
7,1.298593,-0.379795,0.843817,-2.202279
8,2.994009,-1.382653,-1.035187,-3.033649
9,1.201607,-0.579711,-3.095216,-3.133356


In [44]:
df * 5 + 2

Unnamed: 0,A,B,C,D
0,-5.431459,1.932257,4.945985,15.226682
1,-7.130737,9.210164,-2.365936,1.890678
2,4.577703,-2.054073,-1.517346,14.770613
3,1.490364,0.581985,-1.264047,2.480511
4,8.511179,0.147134,-8.474761,9.362526
5,5.677977,4.492628,-1.71754,8.253007
6,-6.845959,1.667978,2.521369,2.375584
7,1.061507,0.033282,9.165068,4.215287
8,9.538584,-4.981007,-0.229949,0.058439
9,0.576576,-0.966298,-10.530097,-0.4401


In [45]:
df

Unnamed: 0,A,B,C,D
0,-1.486292,-0.013549,0.589197,2.645336
1,-1.826147,1.442033,-0.873187,-0.021864
2,0.515541,-0.810815,-0.703469,2.554123
3,-0.101927,-0.283603,-0.652809,0.096102
4,1.302236,-0.370573,-2.094952,1.472505
5,0.735595,0.498526,-0.743508,1.250601
6,-1.769192,-0.066404,0.104274,0.075117
7,-0.187699,-0.393344,1.433014,0.443057
8,1.507717,-1.396201,-0.44599,-0.388312
9,-0.284685,-0.59326,-2.506019,-0.48802


In [46]:
df ** 4

Unnamed: 0,A,B,C,D
0,4.87996,3.369608e-08,0.120515,48.96927
1,11.120989,4.324149,0.581339,2.285358e-07
2,0.07064,0.4322014,0.244895,42.5566
3,0.000108,0.006469099,0.181612,8.529684e-05
4,2.8758,0.01885801,19.261782,4.701402
5,0.29279,0.06176609,0.305592,2.446109
6,9.797148,1.944405e-05,0.000118,3.183826e-05
7,0.001241,0.02393802,4.216978,0.03853366
8,5.167484,3.800077,0.039564,0.02273653
9,0.006568,0.1238738,39.440072,0.05672188


In [47]:
df[:5].T

Unnamed: 0,0,1,2,3,4
A,-1.486292,-1.826147,0.515541,-0.101927,1.302236
B,-0.013549,1.442033,-0.810815,-0.283603,-0.370573
C,0.589197,-0.873187,-0.703469,-0.652809,-2.094952
D,2.645336,-0.021864,2.554123,0.096102,1.472505


In [48]:
np.exp(df)

Unnamed: 0,A,B,C,D
0,0.22621,0.986543,1.80254,14.088183
1,0.161033,4.229285,0.417618,0.978373
2,1.674543,0.444496,0.494866,12.860011
3,0.903095,0.753065,0.520581,1.100872
4,3.67751,0.690339,0.123076,4.360145
5,2.086724,1.646292,0.475443,3.492443
6,0.170471,0.935752,1.109904,1.07801
7,0.828864,0.674797,4.191311,1.557462
8,4.516407,0.247535,0.64019,0.678201
9,0.752251,0.552523,0.081592,0.613841


In [49]:
np.asarray(df)

array([[-1.48629171, -0.01354861,  0.589197  ,  2.6453364 ],
       [-1.82614748,  1.4420329 , -0.87318714, -0.02186445],
       [ 0.51554057, -0.81081459, -0.70346922,  2.55412257],
       [-0.10192716, -0.28360308, -0.65280937,  0.09610218],
       [ 1.3022359 , -0.37057315, -2.09495219,  1.47250525],
       [ 0.73559545,  0.49852568, -0.74350803,  1.25060146],
       [-1.7691918 , -0.06640437,  0.10427382,  0.07511685],
       [-0.18769868, -0.39334356,  1.43301365,  0.44305748],
       [ 1.50771687, -1.39620149, -0.44598978, -0.3883122 ],
       [-0.28468479, -0.59325968, -2.50601937, -0.48802003]])

In [53]:
baseball = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/baseball.csv")

baseball

Unnamed: 0,id,player,year,stint,team,lg,g,ab,r,h,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
0,88641,womacto01,2006,2,CHN,NL,19,50,6,14,...,2.0,1.0,1.0,4,4.0,0.0,0.0,3.0,0.0,0.0
1,88643,schilcu01,2006,1,BOS,AL,31,2,0,1,...,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0
2,88645,myersmi01,2006,1,NYA,AL,62,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,88649,helliri01,2006,1,MIL,NL,20,3,0,0,...,0.0,0.0,0.0,0,2.0,0.0,0.0,0.0,0.0,0.0
4,88650,johnsra05,2006,1,NYA,AL,33,6,0,1,...,0.0,0.0,0.0,0,4.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,89525,benitar01,2007,2,FLO,NL,34,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
96,89526,benitar01,2007,1,SFN,NL,19,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
97,89530,ausmubr01,2007,1,HOU,NL,117,349,38,82,...,25.0,6.0,1.0,37,74.0,3.0,6.0,4.0,1.0,11.0
98,89533,aloumo01,2007,1,NYN,NL,87,328,51,112,...,49.0,3.0,0.0,27,30.0,5.0,2.0,0.0,3.0,13.0


In [54]:
index = pd.date_range("1/1/2000", periods=8)

s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

In [56]:
long_series = pd.Series(np.random.randn(1000))

In [57]:
long_series.head()

0    1.195730
1    0.841068
2    1.486454
3   -1.011070
4    0.426840
dtype: float64

In [58]:
long_series.tail()

995   -0.582971
996   -0.321526
997   -1.542761
998   -1.144966
999   -0.223093
dtype: float64