# ⛳ [Introduction to data structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dsintro)

We'll start a quick non-comprehensive overview of the fundamental data structures in pandas to get you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To get started, import NumPy and load pandas into your namespace:

In [1]:
import numpy as np
import pandas as pd

Here is a basic tenet to keep in mind: **data alignmentis intrinsic.** The link between labels and data will not be broken unless done so explicityly by you.

We'll give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in separate sections.

# Series
`Series` is a cone-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**. The basic method to creae a Series is to call:

```
s = pd.Series(data, index=index)
```

Here, `data` can be many different things:
- a Python dict
- an ndarray
- a scalar value (like 5)

The passed **index** is a list of axis labels. Thus, this separates into a few cases depending on what **data is:**

**From ndarray**

If `data` is an ndarray, **index** must be the same length as **data.** If no index is passed, one will b created having values [0, ..., len(Data -1].

In [2]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
s

a   -1.356653
b    0.827809
c    1.410806
d    0.479457
e    1.486072
dtype: float64

In [3]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [4]:
pd.Series(np.random.randn(5))

0    0.075147
1    1.433913
2   -0.691085
3    1.690981
4    0.590837
dtype: float64

> ## ✨ Note
pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, and exception will be raised that time. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used).  
고유하지 않은 인덱스값 제공 이유 -> groupby 메소드처럼 쓰이지 않는 경우 종종 있기 때문

### **From dict**
Series can be instatiated from dicts:

In [5]:
d = {"b": 1, "a": 0, "c": 2}
d

{'a': 0, 'b': 1, 'c': 2}

In [6]:
pd.Series(d)

b    1
a    0
c    2
dtype: int64

> ## ✨ Note
When the data is a dict, and an index is not passed, the `Series` index will be ordered by the dict's insertion order, if you're using Python version >= 3.6 and pandas version >= 0.23.  
>
> If you're using Python < 3.6 or pandas < 0.23, and an index is not passed, the `Series` index will be the lexically 사전적으로 정의된 ordered list of dict keys. 

In the example above, if you were on a Python version lower than 3.6 or a pandas version lower than 0.23, the `Series` would be ordered by the lexical order of the dict keys (i.e ['a', 'b', 'c'] rather than ['b', 'a', 'c']). 결국 알파벳 순서라는 뜻. 

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.


In [7]:
d = {"a": 0.0, "b": 1.0, "c":2.0}

In [8]:
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [9]:
pd.Series(d, index=["b", "c", "d", "a"])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

> ## ✨ Note
NaN (not a number) is the standard missing data marker used in pandas.

### From scalar value
If `data` is a scalar value, an index must be provided. The value will be repeated to match the length of **index**


In [10]:
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

# Series is ndarray-like
`Series` acts very similarly to a `ndarray`, and is a valid argument to most NumPy functions. However, operations such as  slicing will also slice the index. 

In [11]:
s[0]

-1.35665336174225

In [12]:
s[:3]

a   -1.356653
b    0.827809
c    1.410806
dtype: float64

In [13]:
s[s > s.median()]

c    1.410806
e    1.486072
dtype: float64

In [14]:
s[[4,3,1]]

e    1.486072
d    0.479457
b    0.827809
dtype: float64

In [15]:
np.exp(s)  # Calculate the exponential of all elements in the input array. 밑인 자연상수e인 지수함수 (e^s)를 반환

a    0.257521
b    2.288300
c    4.099259
d    1.615197
e    4.419702
dtype: float64

> ## ✨ Note
We will address array-based indexing like s[[4, 3, 1]] in section on indexing.

Like a NumPy array, a pandas Series has a `dtype`.

In [16]:
s.dtype

dtype('float64')

This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy's type system in a few places, in which case the dtype would be an `ExtensionDtype`. Some examples within pandas are [Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical) and[ Nullable integer data type](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html#integer-na). See [dtypes](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes) for more.

If you need the actual array backing a `Series`, use [Series.array](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.array.html#pandas.Series.array). 시리즈를 배열로 뽑을 때 사용

In [17]:
s

a   -1.356653
b    0.827809
c    1.410806
d    0.479457
e    1.486072
dtype: float64

In [18]:
s.array

<PandasArray>
[ -1.35665336174225, 0.8278090420890765, 1.4108062134278982,
 0.4794567761950909, 1.4860723556951136]
Length: 5, dtype: float64

Accessing the array can be useful when youu need to do some operation without the index (to disable [automatic alignment](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dsintro-alignment), for example). 

Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one or more concrete arrays like a numpy.ndarray. pandas knows how to take an `ExtensionArray` and store it in a `Series` or a column of a `DataFrame`. 

While Series is ndarray-like, if you need an actual ndarray, then use [Series.to_numpy()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy).

In [19]:
s.to_numpy()  # A NumPy ndarray representing the values in this Series or Index.

array([-1.35665336,  0.82780904,  1.41080621,  0.47945678,  1.48607236])

Even if the Series is backed by a ExtensionArray, `Series.to_numpy()` will return a NumPy ndarray.

# Series is dict-like
A Series is like a fixed-size dict in that you can get and set values by index label:

In [20]:
s["a"]

-1.35665336174225

In [21]:
s["e"]

1.4860723556951136

In [22]:
s["e"] = 12.0
s

a    -1.356653
b     0.827809
c     1.410806
d     0.479457
e    12.000000
dtype: float64

In [23]:
"e" in s

True

In [24]:
"f" in s

False

If a label is not contained, an exception is raised:

In [26]:
s["f"]

KeyError: ignored

Using the `get` method, a missing label will return None or specified default:

In [27]:
s.get("f")

In [28]:
s.get("f", np.nan)

nan

See also the [section on attribute access](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-attribute-access).

In [29]:
sa = pd.Series([1,2, 3,], index=list('abc'))
sa.b

2

In [30]:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df 

Unnamed: 0,A,B,C,D
2000-01-01,-0.923331,1.123154,-0.53258,-0.951726
2000-01-02,0.751402,-0.335869,0.161756,1.545172
2000-01-03,-1.116099,-0.88114,0.750553,0.62277
2000-01-04,-0.539731,-1.010538,0.83961,0.62327
2000-01-05,-1.219178,0.154301,2.465859,-2.084617
2000-01-06,-0.896576,-1.534233,-0.402971,-0.733845
2000-01-07,1.212667,-0.79531,0.572087,1.645725
2000-01-08,-1.704699,0.149546,1.117729,0.073846


In [31]:
dfa = df.copy()
dfa.A

2000-01-01   -0.923331
2000-01-02    0.751402
2000-01-03   -1.116099
2000-01-04   -0.539731
2000-01-05   -1.219178
2000-01-06   -0.896576
2000-01-07    1.212667
2000-01-08   -1.704699
Freq: D, Name: A, dtype: float64

In [32]:
dfa.A = list(range(len(dfa.index)))
dfa.A

2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
2000-01-08    7
Freq: D, Name: A, dtype: int64

# Vectorized operations and label alignment with Series
When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray. 

In [33]:
s + s

a    -2.713307
b     1.655618
c     2.821612
d     0.958914
e    24.000000
dtype: float64

In [34]:
s * 2

a    -2.713307
b     1.655618
c     2.821612
d     0.958914
e    24.000000
dtype: float64

In [35]:
np.exp(s)

a         0.257521
b         2.288300
c         4.099259
d         1.615197
e    162754.791419
dtype: float64

A key difference between sEries and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consdieration to whether the Series involved have the same labels.

In [36]:
s[1:] + s[:-1]

a         NaN
b    1.655618
c    2.821612
d    0.958914
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the **union** of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing `NaN`. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data. 

> ## ✨ Note
In general, we chose to make the default result of operations between differently indexed objects yield(출력하다) the **union** of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels **dropna** function.

# Name attribute
Series can also have a `name` attribute:

In [37]:
s = pd.Series(np.random.randn(5), name="something")
s

0    0.357403
1   -0.863760
2    1.766337
3   -0.491810
4    2.241597
Name: something, dtype: float64

In [38]:
s.name

'something'

The series `name` will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame as you will see below.

You can rename a Series with the [pandas.Series.rename()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rename.html#pandas.Series.rename) method.

In [39]:
s2 = s.rename("name changed")
s2.name

'name changed'

Note that `s` and `s2` refer to different objects.

# DataFrame
**DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:
- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A `Series`
- Another `DataFrame`

Along with the data, you can optionally pass **index** (row labels) and **colums** (column labels) argunments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index. 
If axis labels are not passed, they will be constructed from the input data based on common sense rules. 

> ## ✨ Note
When the data is a dict, and `columns` is not specified, the `DataFrame` columns will be oredered by the dict's insertion order, if you are using Python version >= 3.6 and pandas >= 0.23.
If you are using Python < 3.6 or pandas < 0.23, and `columns` is not specified, the `DataFrame` columns will be the lexically ordered list of dict keys.

#From dict of Series of dicts
The resulting **index** will be the **union** of the indexes of the various Series. If there are any nested dicts, these will first be converted to Series. If no columns are passed, the columns will be the ordered list of dict keys. 


In [40]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}

In [41]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [42]:
pd.DataFrame(d, index=["d", "b", "a"])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [44]:
pd.DataFrame(d, index=["d","b", "a"], columns=["two", "three"])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


The row and column labels can be accessed respectively by accessing the **index** and **columns** attributes:

> ## ✨ Note
When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.

In [45]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [46]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [47]:
df.columns

Index(['one', 'two'], dtype='object')