# Pandas DataFrames

2-Dimensional array

Combination of Series that share an index

### How to create a Data Frame

Data - Required List, Dictionary or numpy array
Index - Optional
Columns - Optional

`pd.DataFrame(data, index, columns)`

### Columns

Returns Columns by default

Columns have to accessed by their names, ie columns do not have an index number

###### Single Column

`data_frame_object['name_of_column']`

###### Multiple Columns

This is confusing because we have pass in an iterable

`data_frame_object[['name_of_column', 'name_of_column']]`

###### Adding a new column / Assignment

New Data set must be a single column of the exact same size

`df['new_column_name'] = dataset`

###### Drop a Dataset

Confusing because there are some caveats here

First and foremost if we want to drop a column, we have to specify 2 things
- The Column Name
- The axis of the column which is always 1

Next, `drop` Drops the table from the View, but not from the actual dataset

`df.drop('name_of_column', axis=1)`

Drop from the view and re-assign the view to the same variable

`df = df.drop('name_of_column', axis=1)`

Or Drop InPlace (Not Recommended, because it's being deprecated)

`df.drop('name_of_column', axis=1, inplace=True)`


### Rows

There are a couple ways to access rows

Anything dealing with rows, think of `loc` or `iloc`

`loc` -> location
`iloc` -> index location

###### loc

`loc` function selects the cell or entire column if column is not specified based on the NAME of the row

Takes in iterables for the row and column names

`df.loc[['row_name'], ['column_name']]`

###### iloc

Works similarly to `loc` however this takes in the row index number

`df.iloc['row_number', 'column_number']`

Row selection can be tricky because if we remove a row and try to call it with `loc` we won't be able to, however we can still access it via `iloc`


###### Assigning values

`df.loc['new_row_id'] = value`

###### Drop Row

Drops Row 0

`df = df.drop(0)`


### Other Functions

###### Comparison Operators

Comparions operators work either on the entire data set of selected cells

`df > x` or `df.loc['row_name'] > x` or `df.iloc['row_number']`

Used to either fill in or cleanup values

`df_boolean = df > x
df[df_boolean]
`

###### Filtering

`df[df['column_name] > x]`

Stacking filters uses an & (and; ampersand) or | (or; pipe) operator

`df[(df['column_name] > x) & (df['column_name] < y) | (df['column_name] == z)]`

###### Resetting the Index

Indexes apply to rows unless otherwise specified

Appends a new index starting from 0 and pushes the original index into the dataset but does NOT change the original dataframe, only the view

`df.reset_index()`

For future dataframes

`df = df.reset_index()`

###### Setting the column as the Index

We may want to have a column be the index of the dataframe

Overwrites the original index

For the current view

`df.set_index('column_name')`

For future dataframes

`df = df.set_index('column_name')`

In [310]:
import pandas as pd
import numpy as np

In [320]:
# Create a DataFrame


# List
# list1 = [1,2,3,4]

# pd.Series(list1)
# pd.DataFrame(list1, columns=['Col1'], index=['Row1', 'Row2', 'Row3', 'Row4'])


# Dictionary
# data = {'Col1':[1,2,3,4], 'Col2':[10,20,30,40]}
# pd.DataFrame(data, index=['Row1', 'Row2', 'Row3', 'Row4'])

# Random Seed ensures that random values will always return the same values

np.random.seed(1) # Don't use this in actual development, but as a learning tool to follow along
df = pd.DataFrame(np.random.rand(4,4), index=['Row1','Row2','Row3', 'Row4'], columns=['Col1', 'Col2', 'Col3','Col4'])
df



Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0.417022,0.720324,0.000114,0.302333
Row2,0.146756,0.092339,0.18626,0.345561
Row3,0.396767,0.538817,0.419195,0.68522
Row4,0.204452,0.878117,0.027388,0.670468


In [328]:
# Accessing data from the data frame

# By default when we think of indexing, pandas uses the column and needs the name of the column
# df[0] # this will not work

# Single Row
# df['Col1']

# Multiple Rows
df[['Col2','Col4', 'Col1']]




Unnamed: 0,Col2,Col4,Col1
Row1,0.720324,0.302333,0.417022
Row2,0.092339,0.345561,0.146756
Row3,0.538817,0.68522,0.396767
Row4,0.878117,0.670468,0.204452


In [344]:
# Adding a new column

# df['New Col'] = np.random.rand(5) # This will produce an error because the length does not match the rows
df['New Col'] = np.random.rand(4) # This will be fine because the length of the random function is the same as the rows

In [345]:
df

Unnamed: 0,Col1,Col2,Col3,Col4,New Col
Row1,0.417022,0.720324,0.000114,0.302333,0.691877
Row2,0.146756,0.092339,0.18626,0.345561,0.315516
Row3,0.396767,0.538817,0.419195,0.68522,0.686501
Row4,0.204452,0.878117,0.027388,0.670468,0.834626


In [340]:
# Dropping data from the dataFrame

# Axis 0 -> refers to the row
# Axis 1 -> refers to the columns

# If we want to drop a row, we have to specify axis=1

# But this command only drops it from the view, not the actual dataframe

df.drop('New Col', axis=1)

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0.417022,0.720324,0.000114,0.302333
Row2,0.146756,0.092339,0.18626,0.345561
Row3,0.396767,0.538817,0.419195,0.68522
Row4,0.204452,0.878117,0.027388,0.670468


In [342]:
df = df.drop('New Col', axis=1)

In [346]:
df

Unnamed: 0,Col1,Col2,Col3,Col4,New Col
Row1,0.417022,0.720324,0.000114,0.302333,0.691877
Row2,0.146756,0.092339,0.18626,0.345561,0.315516
Row3,0.396767,0.538817,0.419195,0.68522,0.686501
Row4,0.204452,0.878117,0.027388,0.670468,0.834626


In [347]:
# Drop - do not use this method
# you will see it in documentation and you will see others using it

# Inplace is bad
df.drop('New Col', axis=1, inplace=True)


In [348]:
df

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0.417022,0.720324,0.000114,0.302333
Row2,0.146756,0.092339,0.18626,0.345561
Row3,0.396767,0.538817,0.419195,0.68522
Row4,0.204452,0.878117,0.027388,0.670468


In [359]:
# Rows - Accessing Data in the row

# Loc uses the names of the rows and columns

# Single Row
# df.loc['Row1']

# Multiple Row and multiple columns
df.loc[['Row1', 'Row3'], ['Col1','Col3']]

Unnamed: 0,Col1,Col3
Row1,0.417022,0.000114
Row3,0.396767,0.419195


In [367]:
# iLoc uses th index number of the rows and colums

# The syntax is a little confusing and different

# Iloc[row_number, col_number]
df.iloc[1:4, 0:2]

Unnamed: 0,Col1,Col2
Row2,0.146756,0.092339
Row3,0.396767,0.538817
Row4,0.204452,0.878117


In [370]:
# Rows - Assigning Dat

print(df)

df.loc['Row1', 'Col1'] = 'Changed'

          Col1      Col2      Col3      Col4
Row1  0.417022  0.720324  0.000114  0.302333
Row2  0.146756  0.092339  0.186260  0.345561
Row3  0.396767  0.538817  0.419195  0.685220
Row4  0.204452  0.878117  0.027388  0.670468


In [372]:
df.loc['Row1', 'Col1'] = 0.417022

In [373]:
df

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0.417022,0.720324,0.000114,0.302333
Row2,0.146756,0.092339,0.18626,0.345561
Row3,0.396767,0.538817,0.419195,0.68522
Row4,0.204452,0.878117,0.027388,0.670468


In [377]:
# Rows - Drop Data

# Exact same idea as colums

# Be default it update the view, if you want to change the actual dataframe, save it as the df again

df.drop('Row1')

Unnamed: 0,Col1,Col2,Col3,Col4
Row2,0.146756,0.092339,0.18626,0.345561
Row3,0.396767,0.538817,0.419195,0.68522
Row4,0.204452,0.878117,0.027388,0.670468


In [375]:
df

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0.417022,0.720324,0.000114,0.302333
Row2,0.146756,0.092339,0.18626,0.345561
Row3,0.396767,0.538817,0.419195,0.68522
Row4,0.204452,0.878117,0.027388,0.670468


In [381]:
# Filtering

df > 0.1 # Returns a true false table


Unnamed: 0,Col1,Col2,Col3,Col4
Row1,True,True,False,True
Row2,True,False,True,True
Row3,True,True,True,True
Row4,True,True,False,True


In [382]:
# This is the actual filter that will return the table of values

df[df > 0.1] # This returns a table with values and Nan for False



Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0.417022,0.720324,,0.302333
Row2,0.146756,,0.18626,0.345561
Row3,0.396767,0.538817,0.419195,0.68522
Row4,0.204452,0.878117,,0.670468


In [384]:
# Drop all rows that contain a null

df[df > 0.1].dropna() # allows me to drop all ROWS that have a single NA

Unnamed: 0,Col1,Col2,Col3,Col4
Row3,0.396767,0.538817,0.419195,0.68522


In [388]:
# Comparison operators for stacking conditions use & and |

df[(df > 0.1) & (df <0.6)]

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0.417022,,,0.302333
Row2,0.146756,,0.18626,0.345561
Row3,0.396767,0.538817,0.419195,
Row4,0.204452,,,


In [389]:
# Setting and Resetting the index

df['New index'] = [10,20,30,40]

In [390]:
df

Unnamed: 0,Col1,Col2,Col3,Col4,New index
Row1,0.417022,0.720324,0.000114,0.302333,10
Row2,0.146756,0.092339,0.18626,0.345561,20
Row3,0.396767,0.538817,0.419195,0.68522,30
Row4,0.204452,0.878117,0.027388,0.670468,40


In [394]:
# Reset Index

# Does not modify the original

df.reset_index(drop=True)

Unnamed: 0,Col1,Col2,Col3,Col4,New index
0,0.417022,0.720324,0.000114,0.302333,10
1,0.146756,0.092339,0.18626,0.345561,20
2,0.396767,0.538817,0.419195,0.68522,30
3,0.204452,0.878117,0.027388,0.670468,40


In [395]:
df

Unnamed: 0,Col1,Col2,Col3,Col4,New index
Row1,0.417022,0.720324,0.000114,0.302333,10
Row2,0.146756,0.092339,0.18626,0.345561,20
Row3,0.396767,0.538817,0.419195,0.68522,30
Row4,0.204452,0.878117,0.027388,0.670468,40


In [403]:
# Set the index


df.set_index('New index', drop=True)

Unnamed: 0_level_0,Col1,Col2,Col3,Col4
New index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,0.417022,0.720324,0.000114,0.302333
20,0.146756,0.092339,0.18626,0.345561
30,0.396767,0.538817,0.419195,0.68522
40,0.204452,0.878117,0.027388,0.670468


In [398]:
df

Unnamed: 0,Col1,Col2,Col3,Col4,New index
Row1,0.417022,0.720324,0.000114,0.302333,10
Row2,0.146756,0.092339,0.18626,0.345561,20
Row3,0.396767,0.538817,0.419195,0.68522,30
Row4,0.204452,0.878117,0.027388,0.670468,40
