# Intro to data structures

In [1]:
import numpy as np
import pandas as pd

## Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

s = pd.Series(data, index=index)

data can be many different things: 
- Python dict
- ndarray
- simple scalar value 

The passed index is a list of axis labels. 

### Series from ndarray

In [2]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [3]:
s

a   -0.425328
b    0.797346
c   -0.380400
d   -2.032738
e   -0.214399
dtype: float64

In [4]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [12]:
s.array

<PandasArray>
[-1.2194316817110764, 0.23308167174596667,   1.093022698233768,
  0.7475816949228121,  0.9311898067608387]
Length: 5, dtype: float64

In [14]:
s.to_numpy()

array([-1.21943168,  0.23308167,  1.0930227 ,  0.74758169,  0.93118981])

without specifying index

In [15]:
pd.Series(np.random.randn(5))

0    2.311206
1   -2.047585
2   -1.166093
3    0.233589
4    0.181435
dtype: float64

### Series from dict

In [16]:
d = {"b": 1, "a": 0, "c": 2}

In [17]:
pd.Series(d)

b    1
a    0
c    2
dtype: int64

back to dictionary

In [18]:
sd = pd.Series(d)

In [19]:
sd

b    1
a    0
c    2
dtype: int64

In [20]:
sd.to_dict()

{'b': 1, 'a': 0, 'c': 2}

In [21]:
from collections import OrderedDict, defaultdict

In [23]:
sd.to_dict(OrderedDict)

OrderedDict([('b', 1), ('a', 0), ('c', 2)])

### From scalar value

In [24]:
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

### Series is ndarray-like

In [25]:
s[0]

-1.2194316817110764

In [26]:
s[:3]

a   -1.219432
b    0.233082
c    1.093023
dtype: float64

In [28]:
s[s > s.median()]

c    1.093023
e    0.931190
dtype: float64

In [29]:
s[[4,3,1]]

e    0.931190
d    0.747582
b    0.233082
dtype: float64

In [30]:
np.exp(s)

a    0.295398
b    1.262485
c    2.983278
d    2.111887
e    2.537527
dtype: float64

In [31]:
s.dtype

dtype('float64')

### Series is dictt-like

you can get and set values by index label

In [32]:
s

a   -1.219432
b    0.233082
c    1.093023
d    0.747582
e    0.931190
dtype: float64

In [33]:
s["a"]

-1.2194316817110764

In [34]:
"d" in s

True

In [35]:
s.get("d")

0.7475816949228121

difference between s[] and s.get() (try with missing keys)

In [36]:
s.get("f", np.nan)

nan

## Vectorized operations and label alignment with Series

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. 

The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [38]:
s

a   -1.219432
b    0.233082
c    1.093023
d    0.747582
e    0.931190
dtype: float64

In [39]:
s+s

a   -2.438863
b    0.466163
c    2.186045
d    1.495163
e    1.862380
dtype: float64

In [40]:
np.exp(s)

a    0.295398
b    1.262485
c    2.983278
d    2.111887
e    2.537527
dtype: float64

A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [41]:
s[1:]

b    0.233082
c    1.093023
d    0.747582
e    0.931190
dtype: float64

In [42]:
s[:-1]

a   -1.219432
b    0.233082
c    1.093023
d    0.747582
dtype: float64

In [43]:
s[1:] + s[:-1]

a         NaN
b    0.466163
c    2.186045
d    1.495163
e         NaN
dtype: float64

If unaligned, the result will be marked as missing NaN

### Series name attribute

In [44]:
s = pd.Series(np.random.randn(5), name="something")

In [45]:
s.name

'something'

## -DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame

### From dict of Series or dicts

In [46]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}

In [47]:
df = pd.DataFrame(d)

In [48]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


You can specify the index (row label)

In [49]:
pd.DataFrame(d, index=["d", "b", "a"])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


You can speficy the column (column label)

In [51]:
pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


In [52]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [53]:
df.columns

Index(['one', 'two'], dtype='object')

### From dict of ndarrays / lists

In [54]:
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}

no index is passed: the result will be range(n) 

In [55]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [56]:
pd.DataFrame(d, index=["a", "b", "c", "d"])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


### From a list of dicts

In [63]:
data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]

In [64]:
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [65]:
pd.DataFrame(data2, index=["first", "second"])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [66]:
pd.DataFrame(data2, columns=["a", "b"])

Unnamed: 0,a,b
0,1,2
1,5,10


### From a dict of tuples

In [67]:
pd.DataFrame(
    {
        ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
        ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
        ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
        ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
        ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
    }
)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


### From a Series

In [69]:
ser = pd.Series(range(3), index=list("abc"), name="ser")
ser

a    0
b    1
c    2
Name: ser, dtype: int64

In [70]:
pd.DataFrame(ser)

Unnamed: 0,ser
a,0
b,1
c,2


### Column Selection, addition, deletion

In [71]:
df["one"]

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [72]:
df["three"] = df["one"] * df["two"]
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [73]:
df["flag"] = df["one"] > 2
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,,4.0,,False


In [74]:
del df["two"]
df

Unnamed: 0,one,three,flag
a,1.0,1.0,False
b,2.0,4.0,False
c,3.0,9.0,True
d,,,False


In [75]:
three = df.pop("three")

In [76]:
three

a    1.0
b    4.0
c    9.0
d    NaN
Name: three, dtype: float64

In [77]:
type(three)

pandas.core.series.Series

In [78]:
df

Unnamed: 0,one,flag
a,1.0,False
b,2.0,False
c,3.0,True
d,,False


In [79]:
df["foo"] = "bar"

naturally be propagated to fill the column:

In [80]:
df

Unnamed: 0,one,flag,foo
a,1.0,False,bar
b,2.0,False,bar
c,3.0,True,bar
d,,False,bar


In [81]:
df["one_trunc"] = df["one"][:2]

In [82]:
df

Unnamed: 0,one,flag,foo,one_trunc
a,1.0,False,bar,1.0
b,2.0,False,bar,2.0
c,3.0,True,bar,
d,,False,bar,


column insert

In [84]:
df.insert(1,"bar", df["one"])   # if you want to speficy the location of column

In [85]:
df

Unnamed: 0,one,bar,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,2.0,False,bar,2.0
c,3.0,3.0,True,bar,
d,,,False,bar,


### Indexing / selection

df[col] : returns Series

df.loc[label]  returns Series (select row by label)

df.iloc[loc] returns Series (select row by integer location)

df[5:10] returns 