# <u>DataFrames.

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

# <u>Definition:

    We can think of a DataFrame as a bunch of Series objects put together to share the same index.

---

# <u>Imports.

In [5]:
import pandas as pd
import numpy as np

In [6]:
from numpy.random import randn

#### What is a seed?

    - A seed is a starting point for generating a sequence of pseudo-random numbers.
    - Computers don’t generate truly random numbers, they use algorithms that produce a sequence that looks random — hence, pseudo-random.
    - This sequence is entirely determined by the seed.
    - Always run the np.random.seed(value) in the same cell as the np.random.rand(shape) / any random value generater. Or we must run np.random.seed(value) before generating random numbers.

In [8]:
np.random.seed(101)

In [9]:
# Using print() to display every value.

print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))

[0.51639863 0.57066759 0.02847423]
[0.17152166 0.68527698 0.83389686]
[0.30696622 0.89361308 0.72154386]
[0.18993895 0.55422759 0.35213195]
[0.1818924  0.78560176 0.96548322]


- <u>NOTE:

    - If we keep running the above two cells in order i.e np.random.seed(101) first and the print statements later we would get the same random numbers.

In [11]:
# Seed resets.

print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))

[0.23235366 0.08356143 0.60354842]
[0.72899276 0.27623883 0.68530633]
[0.51786747 0.04848454 0.13786924]
[0.18696743 0.9943179  0.5206654 ]
[0.57878954 0.73481906 0.54196177]


In [12]:
# OR we could run np.random.seed(101) in the same cells as the print statements.
# Preferred method.

np.random.seed(101)
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))

[0.51639863 0.57066759 0.02847423]
[0.17152166 0.68527698 0.83389686]
[0.30696622 0.89361308 0.72154386]
[0.18993895 0.55422759 0.35213195]
[0.1818924  0.78560176 0.96548322]


In [13]:
np.random.seed(101)
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))

[0.51639863 0.57066759 0.02847423]
[0.17152166 0.68527698 0.83389686]
[0.30696622 0.89361308 0.72154386]
[0.18993895 0.55422759 0.35213195]
[0.1818924  0.78560176 0.96548322]


In [14]:
# # Seed resets.

print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))
print(np.random.rand(3))

[0.23235366 0.08356143 0.60354842]
[0.72899276 0.27623883 0.68530633]
[0.51786747 0.04848454 0.13786924]
[0.18696743 0.9943179  0.5206654 ]
[0.57878954 0.73481906 0.54196177]


- <u>NOTE:

    - We can reset the seed simply by calling np.random.seed().

---

# <u>Creating a DataFrame object.

#### DataFrame:

    Think of it as a spreadsheet or an SQL table.
    It's a data object — specifically, a 2-D labeled data structure provided by pandas.
    We can think of a DataFrame as a bunch of Series objects put together to share the same index.
    It’s made up of rows and columns.

In [19]:
# We don't need to call np.random.randn as we already ran from numpy.random import randn

df = pd.DataFrame(data = randn(5, 4), index = ['A', 'B', 'C', 'D', 'E'], columns = ['W', 'X', 'Y', 'Z'])

- <u>NOTE:

    - We have a data argument and an index argument just like we did for series but then we have this additional columns arguments.

In [21]:
# We could also use the .split()
# df = pd.DataFrame(randn(5, 4), index = 'A B C D E'.split(), columns = 'W X Y Z'.split())

In [22]:
df

Unnamed: 0,W,X,Y,Z
A,-1.706086,-1.159119,-0.134841,0.390528
B,0.166905,0.184502,0.807706,0.07296
C,0.638787,0.329646,-0.497104,-0.75407
D,-0.943406,0.484752,-0.116773,1.901755
E,0.238127,1.996652,-0.993263,0.1968


- <u>NOTE:

    - So basically what we have here is a list of columns.
    - Now each of these columns is actually just a panda's series and and they all share a common index. (Ex: W is a column that share the same index(A,B,C,D,E) with other columns(X,Y,Z).)

---

# <u>Selection and Indexing.

Let's learn the various methods to grab data from a DataFrame:

1) <u>**To get a single column back.**

In [27]:
# Passing in the column name.

df['W']

A   -1.706086
B    0.166905
C    0.638787
D   -0.943406
E    0.238127
Name: W, dtype: float64

- <u>NOTE:

    -  When I ask for a single column I'm actually getting back a series.

In [29]:
type(df['W'])

pandas.core.series.Series

In [30]:
type(df)

pandas.core.frame.DataFrame

- <u>NOTE:

    - Checking the type further reiterate that fact that DataFrame is a bunch of Series objects put together to share the same index.

In [32]:
# SQL Syntax (NOT RECOMMENDED!) (Table.column_name)

df.W

A   -1.706086
B    0.166905
C    0.638787
D   -0.943406
E    0.238127
Name: W, dtype: float64

- <u>NOTE:

    - However we shoudn't use this because it may get confused with the various methods that are available off of the data frame.
    - What may happen is that one of these methods gets overwritten by a column name and then Pandas is going to get confused whether we're asking for a method or an actual column name.

2) <u>**To get a multiple columns back.**

In [35]:
# Passing in the columns as a list.

df[['W', 'Z']]

Unnamed: 0,W,Z
A,-1.706086,0.390528
B,0.166905,0.07296
C,0.638787,-0.75407
D,-0.943406,1.901755
E,0.238127,0.1968


- <u>NOTE:

    - When I ask for multiple columns I'm actually getting back a DataFrame.

3. <u>**Creating a new column:**

In [38]:
# KeyError: 'new'
# df['new']

In [39]:
# NOTE: It's just like in an excel sheet.
# Adding two columns to create a new one.

df['new'] = df['W'] + df['Y']

In [40]:
df

Unnamed: 0,W,X,Y,Z,new
A,-1.706086,-1.159119,-0.134841,0.390528,-1.840927
B,0.166905,0.184502,0.807706,0.07296,0.974611
C,0.638787,0.329646,-0.497104,-0.75407,0.141683
D,-0.943406,0.484752,-0.116773,1.901755,-1.06018
E,0.238127,1.996652,-0.993263,0.1968,-0.755137


4. <u>**Removing Columns:**

In [42]:
# KeyError: "['new'] not found in axis"

# df.drop('new')

- <u>Syntax: df.drop(labels, axis = 0, inplace = False)

    - axis = 0 refers to index / rows.
    - axis = 1 refers to columns.

In [44]:
df.drop('new', axis = 1)

Unnamed: 0,W,X,Y,Z
A,-1.706086,-1.159119,-0.134841,0.390528
B,0.166905,0.184502,0.807706,0.07296
C,0.638787,0.329646,-0.497104,-0.75407
D,-0.943406,0.484752,-0.116773,1.901755
E,0.238127,1.996652,-0.993263,0.1968


In [45]:
# Does not affect the DataFrame object unless we use the inplace argument.

df

Unnamed: 0,W,X,Y,Z,new
A,-1.706086,-1.159119,-0.134841,0.390528,-1.840927
B,0.166905,0.184502,0.807706,0.07296,0.974611
C,0.638787,0.329646,-0.497104,-0.75407,0.141683
D,-0.943406,0.484752,-0.116773,1.901755,-1.06018
E,0.238127,1.996652,-0.993263,0.1968,-0.755137


- <u>NOTE:

    - Many Pandas methods will require this inplace argument to be set to True. And the reason Pandas does this is to ensure we do not accidentally lose information.

In [47]:
df.drop('new', axis = 1, inplace = True)

In [48]:
df

Unnamed: 0,W,X,Y,Z
A,-1.706086,-1.159119,-0.134841,0.390528
B,0.166905,0.184502,0.807706,0.07296
C,0.638787,0.329646,-0.497104,-0.75407
D,-0.943406,0.484752,-0.116773,1.901755
E,0.238127,1.996652,-0.993263,0.1968


- Can also drop rows this way:

In [50]:
# inplace = False

df.drop('E', axis = 0, inplace = False)
# OR df.drop('E')

Unnamed: 0,W,X,Y,Z
A,-1.706086,-1.159119,-0.134841,0.390528
B,0.166905,0.184502,0.807706,0.07296
C,0.638787,0.329646,-0.497104,-0.75407
D,-0.943406,0.484752,-0.116773,1.901755


In [51]:
df

Unnamed: 0,W,X,Y,Z
A,-1.706086,-1.159119,-0.134841,0.390528
B,0.166905,0.184502,0.807706,0.07296
C,0.638787,0.329646,-0.497104,-0.75407
D,-0.943406,0.484752,-0.116773,1.901755
E,0.238127,1.996652,-0.993263,0.1968


- <u>NOTE:

    - Why are the rows 0 for axis and why are the columns 1?
    - Ans: The reference actually comes back to NumPy. Since DataFrames are essentially just fancy index markers on top of a NumPy array.
    - To show this:

In [53]:
# 5 rows and 4 columns.
# 2-D array (Matrix).

df.shape

(5, 4)

- <u>NOTE:

    - Which is why rows are referred to as the 0 axis and columns are referred to as the 1 axis because it's directly taken from the shape just as we would have in a NumPy array.

5. <u>**To get a single Row back:**

        There are two methods we can choose from:
            1. loc:
                loc stands for location.
                Syntax: loc[index_label]
                Notice that this method uses [] instead of (), it's just how it works with Pandas. 
            2. iloc:
                Grabs rows based on the index location.
                Syntax: iloc[index]

In [56]:
df

Unnamed: 0,W,X,Y,Z
A,-1.706086,-1.159119,-0.134841,0.390528
B,0.166905,0.184502,0.807706,0.07296
C,0.638787,0.329646,-0.497104,-0.75407
D,-0.943406,0.484752,-0.116773,1.901755
E,0.238127,1.996652,-0.993263,0.1968


In [57]:
# Returns a series.

df.loc['A']

W   -1.706086
X   -1.159119
Y   -0.134841
Z    0.390528
Name: A, dtype: float64

In [58]:
df.loc['E']

W    0.238127
X    1.996652
Y   -0.993263
Z    0.196800
Name: E, dtype: float64

In [59]:
df.iloc[4]

W    0.238127
X    1.996652
Y   -0.993263
Z    0.196800
Name: E, dtype: float64

6. <u>**To get a multiple Rows back:**

In [61]:
df.loc[['A', 'B']]

Unnamed: 0,W,X,Y,Z
A,-1.706086,-1.159119,-0.134841,0.390528
B,0.166905,0.184502,0.807706,0.07296


In [62]:
df.iloc[[0, 1]]

Unnamed: 0,W,X,Y,Z
A,-1.706086,-1.159119,-0.134841,0.390528
B,0.166905,0.184502,0.807706,0.07296


7. <u>**Selecting subset of rows and columns**

    - Syntax:

            df.loc[row, column] ---> similar to shape
            df.iloc[row, column]

In [64]:
df

Unnamed: 0,W,X,Y,Z
A,-1.706086,-1.159119,-0.134841,0.390528
B,0.166905,0.184502,0.807706,0.07296
C,0.638787,0.329646,-0.497104,-0.75407
D,-0.943406,0.484752,-0.116773,1.901755
E,0.238127,1.996652,-0.993263,0.1968


In [65]:
# Returns a single value.

df.loc['B', 'Y']

0.8077059142577141

In [66]:
df.iloc[1, 2]

0.8077059142577141

Get the subset of 'A' and 'B' rows with the 'W' and 'Y' columns:

In [68]:
df.loc[['A', 'B'], ['W', 'Y']]

Unnamed: 0,W,Y
A,-1.706086,-0.134841
B,0.166905,0.807706


In [69]:
df.iloc[[0, 1], [0, 2]]

Unnamed: 0,W,Y
A,-1.706086,-0.134841
B,0.166905,0.807706


---