### Notes on Ch 5: Getting Started with pandas

pandas is a fundamental tool for data analysis in Python. It offers data structures and manipulation tools for efficient data cleaning and analysis. It incorporates NumPy's array-based computing style but focuses on working with tabular or heterogeneous data. In contrast, NumPy is best suited for homogeneous numerical arrays.

In [1]:
import numpy as np
import pandas as pd

#### Introduction to pandas Data Structures

##### Series

A Series is a one-dimensional array-like object that holds a sequence of values, similar to NumPy types, and an associated index that labels the data.

In [2]:
obj = pd.Series([4, 7, -5, 3])
print(obj)

0    4
1    7
2   -5
3    3
dtype: int64


You can create a Series with a custom index to label each data point:

In [3]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)

d    4
b    7
a   -5
c    3
dtype: int64


You can use labels in the index to select values:

In [4]:
print(obj2['a'])

-5


A Series can be seen as a fixed-length, ordered dictionary, where index values map to data values. You can use it like a dictionary:

In [6]:
"b" in obj2

True

You can create a Series from a Python dictionary:

In [8]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print(obj3)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


* You can convert a Series back to a dictionary using obj3.to_dict().
* The order of keys in the dictionary determines the order in the resulting Series. You can override this by passing an index explicitly.

<b>Handling Missing Data</b>

Missing data in pandas is represented as `NaN` (Not a Number). You can detect missing data using ```pd.isna(obj)``` or ```pd.notna(obj)```. Series also has instance methods for detecting missing data.

<b>Automatic Alignment</b>

Series automatically aligns data by index label in arithmetic operations. This alignment is similar to a database join operation.

In [10]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)

result = obj3 + obj4
print(result)

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


Both the Series object and its index can have names. You can rename the Series index in place by assignment.

In [11]:
obj4.name = "population"
obj4.index.name = "state"
print(obj4)

obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
print(obj)

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64


##### DataFrame

DataFrame is a two-dimensional tabular data structure in pandas, similar to a spreadsheet or SQL table. It consists of rows and columns, where each column can have a different data type (e.g., numeric, string, Boolean).  We can think of it as a collection of Series objects, with each column being a Series.

In [3]:
data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


DataFrames have both row and column indices. Rows are indexed by default, and columns are named according to the keys in the dictionary. You can use `head()` and `tail()` methods to display the first and last rows, respectively. Columns can be accessed using dictionary-like notation or dot notation. New columns can be added by assignment.

In [4]:
# Firrst and last 5 rows
print(frame.head())
print(frame.tail())

# Acessing columns
print(frame["state"])
print(frame.year)

# Creting new columns

frame["debt"] = 16.5
frame["debt"] = np.arange(6.)
frame

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
    state  year  pop
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64


Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,0.0
1,Ohio,2001,1.7,1.0
2,Ohio,2002,3.6,2.0
3,Nevada,2001,2.4,3.0
4,Nevada,2002,2.9,4.0
5,Nevada,2003,3.2,5.0


If you assign a Series to a DataFrame column, it will be aligned with the DataFrame's index, filling missing values with NaN. Columns can be deleted using the del keyword

In [5]:
del frame["debt"]
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


Rows can also be retrieved by position or name with the special iloc and loc attributes

In [9]:
print(frame.loc[2])

print(frame.iloc[1])

state    Ohio
year     2002
pop       3.6
Name: 2, dtype: object
state    Ohio
year     2001
pop       1.7
Name: 1, dtype: object


DataFrames can be transposed to swap rows and columns:

In [10]:
frame.T

Unnamed: 0,0,1,2,3,4,5
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
year,2000,2001,2002,2001,2002,2003
pop,1.5,1.7,3.6,2.4,2.9,3.2


We can convert a DataFrame to a NumPy array using the to_numpy() method:

In [11]:
frame.to_numpy()

array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2]], dtype=object)

##### Index Objects

Index objects in pandas are responsible for holding the axis labels (including column names in DataFrames) and other metadata like the axis name or names. Any array or sequence of labels used when creating a Series or DataFrame is internally converted to an Index.

In [2]:
obj = pd.Series(range(3), index=["a", "b", "c"])
index = obj.index
print(index[1:])

Index(['b', 'c'], dtype='object')


 Index objects are immutable, meaning they cannot be modified by the user. For example, you cannot change an element of the index. 

#### Essential Functionality

##### Reindexing

 In pandas, the reindex method is used to create a new object with values rearranged to match a new index. This can be applied to both Series and DataFrames.

In [3]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
print(obj)

obj2 = obj.reindex(["a", "b", "c", "d", "e"])
print(obj2)

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64


The reindex method rearranges data based on the new index, introducing missing values for absent index values.

For ordered data like time series, you can use the method option to specify an interpolation method, e.g., "ffill" for forward-filling.

In DataFrames, you can reindex rows, columns, or both.

In [4]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=["a", "c", "d"], columns=["Ohio", "Texas", "California"])
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame.reindex(columns=["Texas", "Utah", "California"])

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


When reindexing columns, any columns not present in the new index are dropped from the result.

You can reindex by specifying the axis using the axis keyword.

The method, fill_value, limit, tolerance, level, and copy arguments offer various customization options for reindexing.

For more precise reindexing and data selection, you can use the loc operator, but this method requires that all new index labels already exist in the DataFrame.

##### Dropping Entries from an Axis

You can drop one or more entries from a Series using the `drop` method.

In [5]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
print(obj)

new_obj = obj.drop("c")
print(new_obj)

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64


In a DataFrame, you can delete index values (rows) or column labels. To drop rows, use the `index` keyword:

In [6]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
print(data)

data.drop(["Colorado", "Ohio"])

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


To drop columns, use the columns keyword or specify axis=1:

In [7]:
data.drop("two", axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


##### Indexing Selection, and Filtering

Series indexing works like NumPy array indexing but allows using index labels instead of only integers. You can select elements by label or position.

In [8]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])

print(obj["b"])
print(obj[1])
print(obj[2:4])

1.0
1.0
c    2.0
d    3.0
dtype: float64


The `loc` operator is preferred when selecting data by label since it explicitly uses labels, avoiding ambiguity.

In [9]:
obj.loc["b"]

1.0

Use `iloc` for integer-based indexing to work consistently, regardless of the index type.

In [11]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj1.iloc[[0, 1, 2]]


2    1
0    2
1    3
dtype: int64

Slicing with labels is inclusive on both ends:

In [12]:
obj2.loc["b":"c"]

b    7.2
c    3.6
dtype: float64

You can assign values using these methods to modify the Series:

In [15]:
obj2.loc["b":"c"] = 5
print(obj2)

a   -5.3
b    5.0
c    5.0
d    4.5
e    NaN
dtype: float64


Select a single column using DataFrame indexing:

In [16]:
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

Select multiple columns by providing a list:

In [17]:
data[["three", "one"]]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


Use Boolean arrays for data selection:

In [18]:
data[data["three"] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


<b>Selection on DataFrame with loc and iloc</b>

`loc` and `iloc` can also be used in DataFrames for indexing.

In [23]:
# One column creates a Series
print(data.loc["Utah"])
print("")
# Multiple columns creates a DataFrame
print(data.loc[["Utah", "Colorado"]])
print("")
# Combining row and column selections
print(data.loc["Utah", ["two", "three"]])   

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

          one  two  three  four
Utah        8    9     10    11
Colorado    4    5      6     7

two       9
three    10
Name: Utah, dtype: int64


In [24]:
# Similarly, we can use iloc
print(data.iloc[2])
print("")
print(data.iloc[[2, 0, 1]])
print("")
print(data.iloc[2, [3, 0, 1]])

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

          one  two  three  four
Utah        8    9     10    11
Ohio        0    1      2     3
Colorado    4    5      6     7

four    11
one      8
two      9
Name: Utah, dtype: int64


##### Pitfalls with chained indexing

You can modify DataFrame data by assigning values using labels or integer positions.

In [25]:
data.loc[:, "one"] = 1
data.iloc[2] = 5
data.loc[data["four"] > 5] = 3

Avoid chained indexing when doing assignments to prevent ambiguity.Chained indexing might trigger a `SettingWithCopyWarning`.

In [26]:
data.loc[data.three == 5, "three"] = 6

#### Arithmetic and Data Alignment

When performing arithmetic operations between Series objects, pandas aligns the data based on index labels. For any index pairs that don't match, the result will contain a missing value (`NaN`). The missing values will propagate in further arithmetic computations.


In [28]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

DataFrames perform alignment on both rows and columns when performing arithmetic operations. The result's index and columns will be the union of the indexes and columns from both DataFrames. Missing values (`NaN`) are introduced in locations where the labels don't overlap in both DataFrames.

In [29]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"), index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])

print(df1)
print("")
print(df2)
print("")
print(df1 + df2)

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN


If two DataFrames have no column or row labels in common, the result will contain all nulls (`NaN`).

##### Arithmetic methods with fill values

To handle missing values in arithmetic operations, you can use the `add` method with a `fill_value` parameter. This parameter substitutes the specified value for missing values. 

In [32]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list("abcde"))

df2.loc[1, "b"] = np.nan
print(df1)
print("")
print(df2)
print("")
print(df1 + df2)
print("")

# The fill_value of 0 replaces missing values in the arithmetic operation result.
print(df1.add(df2, fill_value=0))

     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0   5.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0


##### Operations between DataFrame and Series

Arithmetic operations between a DataFrame and a Series are similar to the broadcasting concept known from NumPy. 

In [33]:
# Numpy example
arr = np.arange(12.).reshape((3, 4))
print(arr)
print("")
print(arr[0])
print("")
print(arr - arr[0])

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]]

[0. 1. 2. 3.]

[[0. 0. 0. 0.]
 [4. 4. 4. 4.]
 [8. 8. 8. 8.]]


Subtracting the row from the array is performed element-wise for each row, a concept known as broadcasting. In the context of pandas, operations between a DataFrame and a Series work similarly:

In [34]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])
series = frame.iloc[0]
print(frame)
print("")
print(series)
print("")
print(frame - series)

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0


By default, operations align the Series index with the DataFrame columns and perform broadcasting down the rows. This matches the Series's elements with the DataFrame's columns. If an index value is not found in either the DataFrame's columns or the Series's index, the objects will be reindexed to form the union:

In [37]:
series2 = pd.Series(range(3), index=["b", "e", "f"])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In this case, any missing values are filled with `NaN`. If you want to broadcast over the DataFrame's columns and match on the rows, you can use one of the arithmetic methods and specify the axis. For example:

In [36]:
series3 = frame["d"]
frame.sub(series3, axis="index")

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


Specifying `axis="index"` means matching along the DataFrame's row index and broadcasting across the columns.

#### Function Application and Mapping

Pandas allows you to apply NumPy universal functions (ufuncs) and other functions to its objects. Here are some common scenarios:

In [2]:
# Applying NumPy ufuncs to a DataFrame:

frame = pd.DataFrame(np.random.randn(4, 3), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)
print("")
print(np.abs(frame))

               b         d         e
Utah   -0.531631 -0.307995  0.265896
Ohio   -0.675501 -0.721530  0.406466
Texas   0.883562 -0.420230  0.220268
Oregon  1.295925 -0.432935  0.774463

               b         d         e
Utah    0.531631  0.307995  0.265896
Ohio    0.675501  0.721530  0.406466
Texas   0.883562  0.420230  0.220268
Oregon  1.295925  0.432935  0.774463


In this case, you apply the `np.abs` function element-wise to all elements in the DataFrame, which computes the absolute value of each element.

In [5]:
# Using the apply method to apply a function to columns:
def f1(x):
    return x.max() - x.min()

frame.apply(f1)


b    1.971426
d    0.413534
e    0.554195
dtype: float64

Here, the `apply` method applies the function `f1` to each column in the DataFrame. The result is a Series containing the differences between the maximum and minimum values in each column.

In [6]:
# Applying a function to rows using apply with axis="columns":
frame.apply(f1, axis="columns")

Utah      0.797527
Ohio      1.127996
Texas     1.303791
Oregon    1.728861
dtype: float64

By specifying `axis="columns,"` the `apply` method applies the function to each row, yielding a Series with the differences between the maximum and minimum values for each row.

In [7]:
# Applying a function that returns multiple values:
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])

frame.apply(f2)

Unnamed: 0,b,d,e
min,-0.675501,-0.72153,0.220268
max,1.295925,-0.307995,0.774463


The `apply` method can return a Series with multiple values for each column. In this case, the function f2 computes both the minimum and maximum values for each column and returns them as a Series.

In [9]:
# Applying element-wise Python functions with applymap:
def my_format(x):
    return f"{x:.2f}"

frame.applymap(my_format)

Unnamed: 0,b,d,e
Utah,-0.53,-0.31,0.27
Ohio,-0.68,-0.72,0.41
Texas,0.88,-0.42,0.22
Oregon,1.3,-0.43,0.77


The `applymap` method applies the `my_format` function element-wise to each element in the DataFrame, formatting the floating-point values as strings.

In [10]:
# Using the map method on a Series to apply element-wise functions:
frame["e"].map(my_format)

Utah      0.27
Ohio      0.41
Texas     0.22
Oregon    0.77
Name: e, dtype: object

The `map` method is available for Series objects, allowing you to apply element-wise functions to the Series's values. In this case, the `my_format` function is applied to each value in the "e" column.

##### Sorting and Ranking

Pandas provides methods for sorting and ranking data within Series and DataFrames. To sort a Series by its index, you can use the `sort_index` method. By default, it sorts in ascending order.


In [11]:
obj = pd.Series(range(4), index=["d", "a", "b", "c"])
print(obj)
print("")
print(obj.sort_index())

d    0
a    1
b    2
c    3
dtype: int64

a    1
b    2
c    3
d    0
dtype: int64


You can sort a DataFrame by its index on either the rows or columns by using sort_index. You can specify the sorting axis using the axis parameter.

In [13]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=["three", "one"], columns=["d", "a", "b", "c"])
print(frame)
print("")
print(frame.sort_index())
print("")
print(frame.sort_index(axis=1))
print("")
print(frame.sort_index(axis="columns"))
print("")
print(frame.sort_index(axis=1, ascending=False)) # Sorts in descending order

       d  a  b  c
three  0  1  2  3
one    4  5  6  7

       d  a  b  c
one    4  5  6  7
three  0  1  2  3

       a  b  c  d
three  1  2  3  0
one    5  6  7  4

       a  b  c  d
three  1  2  3  0
one    5  6  7  4

       d  c  b  a
three  0  3  2  1
one    4  7  6  5


To sort a Series by its values, you can use the sort_values method. Missing values are sorted to the end by default.

In [15]:
obj = pd.Series([4, 7, -3, 2])
print(obj)
print("")
print(obj.sort_values())
print("")
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

print(obj.sort_values(na_position="first")) # NaN values are placed first

0    4
1    7
2   -3
3    2
dtype: int64

2   -3
3    2
0    4
1    7
dtype: int64

2   -3
3    2
0    4
1    7
dtype: int64


You can sort a DataFrame by one or more columns using sort_values.

In [17]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
frame.sort_values("b")

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


The `rank` method assigns ranks from 1 through the number of valid data points in an array, starting from the lowest value. By default, it breaks ties by assigning each group the mean rank.

In [19]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
print(obj)
print("")
print(obj.rank())
print("")
print(obj.rank(method="first")) # Ties are broken by the order in which they appear in the data
print("")
print(obj.rank(ascending=False, method="max")) # Assign tie values the maximum rank in the group

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64


##### Axis Indexes with Duplicate Labels

In pandas, index labels (row or column names) are not required to be unique. You can have duplicate labels within an index or column. Here's how pandas handles such situations:

1. Creating a Series with Duplicate Index Labels:

You can create a Series with duplicate index labels. For example:

In [20]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
print(obj)
print("")
print(obj.index.is_unique) # Checks if index is unique

a    0
a    1
b    2
b    3
c    4
dtype: int64

False


2. Data Selection with Duplicate Labels:

When you index a label with multiple entries, it returns a Series. If you index a label with a single entry, it returns a scalar value:

In [21]:
print(obj["a"])
print(obj["c"])

a    0
a    1
dtype: int64
4


3. DataFrames with Duplicate Index Labels:
Duplicate labels can also exist in DataFrames. Here's an example:

In [22]:
df = pd.DataFrame(np.random.standard_normal((5, 3)), index=["a", "a", "b", "b", "c"])
print(df)
print("")
print(df.loc["b"]) # Selects all rows with index "b"
print("")
print(df.iloc[2]) # Selects row 2
print("")
print(df.loc["b", 1]) # Selects row "b" and column 1

          0         1         2
a -0.178518 -1.247357  1.570519
a  0.158342  1.273858  0.790400
b  0.536472 -0.835922  0.513239
b  1.298136  0.180909 -0.913476
c -0.298435  0.689162 -0.196841

          0         1         2
b  0.536472 -0.835922  0.513239
b  1.298136  0.180909 -0.913476

0    0.536472
1   -0.835922
2    0.513239
Name: b, dtype: float64

b   -0.835922
b    0.180909
Name: 1, dtype: float64


#### Summarizing and Computing Descriptive Statistics