# The `DataFrame`

The DataFrame data structure is the heart of the Panda's library. It's a primary object that for data analysis and cleaning tasks.

The DataFrame is conceptually a **two-dimensional series object**, where there's an **`index`** and **`multiple columns of  content`**, with **each column having a label**. In fact, the distinction between a column and a row is really only a  conceptual distinction. 

And you can think of the DataFrame itself as simply a **two-axes labeled array.**

In [1]:
import pandas as pd

In [2]:
# I'll create each as a series which has a student name, the class name, and the score. 
record1 = pd.Series({'Name': 'Alice',
                        'Class': 'Physics',
                        'Score': 85})
record2 = pd.Series({'Name': 'Jack',
                        'Class': 'Chemistry',
                        'Score': 82})
record3 = pd.Series({'Name': 'Helen',
                        'Class': 'Biology',
                        'Score': 90})

In [3]:
# Like a Series, the DataFrame object is index. Here I'll use a group of series, where each series 
# represents a row of data. Just like the Series function, we can pass in our individual items
# in an array, and we can pass in our index values as a second arguments

df = pd.DataFrame([record1, record2, record3],
                  index=['school1', 'school2', 'school1'])

# And just like the Series we can use the head() function to see the first several rows of the
# dataframe, including indices from both axes, and we can use this to verify the columns and the rows
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [5]:
# An alternative method is that you could use a list of dictionaries, where each dictionary 
# represents a row of data.

students = [{'Name': 'Alice',
              'Class': 'Physics',
              'Score': 85},
            {'Name': 'Jack',
             'Class': 'Chemistry',
             'Score': 82},
            {'Name': 'Helen',
             'Class': 'Biology',
             'Score': 90}]

# Then we pass this list of dictionaries into the DataFrame function
df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
# And lets print the head again
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [6]:
# Similar to the series, we can extract data using the **.iloc** and **.loc** attributes. Because the 
# DataFrame is two-dimensional, passing a single value to the loc indexing operator will return 
# the series if there's only one row to return.

# For instance, if we wanted to select data associated with school2, we would just query the 
# .loc attribute with one parameter.
df.loc['school2']

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

In [7]:
# We can check the data type of the return using the python type function.
type(df.loc['school2'])

pandas.core.series.Series

In [24]:
# remember that the indices and column names along either axes horizontal or 
# vertical, could be **non-unique**. In this example, we see two records for school1 as different rows.
# If we use a single value with the DataFrame lock attribute, multiple rows of the DataFrame will 
# return, not as a new series, but as a new DataFrame.

# Lets query for school1 records
df.loc['school1']  #probablr bcos sch 1 has 2 diff values

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school1,Helen,Biology,90


In [25]:
# And we can see the the type of this is different too
type(df.loc['school1'])

pandas.core.frame.DataFrame

In [27]:
# One of the powers of the Panda's DataFrame is that you can quickly select data based on multiple axes.
# For instance, if you wanted to just list the student names for school1, you would supply two 
# parameters to .loc, one being the row index and the other being the column name.

# For instance, if we are only interested in school1's student names
df.loc['school1', 'Name'] # morelike saying school1 > name

school1    Alice
school1    Helen
Name: Name, dtype: object

In [30]:
# we just wanted to select a single column

# Firstly, we could transpose the matrix. This pivots all of the rows into columns
# and all of the columns into rows, and is done with the T attribute
print(df.T) # to transpose

# Then we can call .loc on the transpose to get the student names only
df.T.loc['Name']

       school1    school2  school1
Name     Alice       Jack    Helen
Class  Physics  Chemistry  Biology
Score       85         82       90


school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [31]:
# However, since iloc and loc are used for row selection, Panda reserves the indexing operator 
# directly on the DataFrame for column selection. In a Panda's DataFrame, columns always have a name. 
# So this selection is always label based, and is not as confusing as it was when using the square 
# bracket operator on the series objects. For those familiar with relational databases, this operator 
# is analogous to column projection.
df['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [37]:
# In practice, this works really well since you're often trying to add or drop new columns. However,
# this also means that you get a key error if you try and use .loc with a column name

# df.loc['Name'] # error since this is not a key

In [39]:
# Note too that the result of a single column projection is a Series object
type(df['Name']) # series since the name is just a col

pandas.core.series.Series

In [43]:
# Since the result of using the indexing operator is either a DataFrame or Series, you can chain 
# operations together. For instance, we can select all of the rows which related to school1 using
# .loc, then project the name column from just those rows

df.loc['school1']['Name']
# df.loc['school2']['Name']


'Jack'

In [44]:
# If you get confused, use type to check the responses from resulting operations
print(type(df.loc['school1'])) #should be a DataFrame
print(type(df.loc['school1']['Name'])) #should be a Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [45]:
# Chaining, by indexing on the return type of another index, can come with some costs and is
# best avoided if you can use another approach. In particular, chaining tends to cause Pandas 
# to return a copy of the DataFrame instead of a view on the DataFrame. 
# For selecting data, this is not a big deal, though it might be slower than necessary. 
# If you are changing data though this is an important distinction and can be a source of error.

In [49]:
# Here's another approach. As we saw, .loc does row selection, and it can take two parameters, 
# the row index and the list of column names. The .loc attribute also supports slicing.

# If we wanted to select all rows, we can use a colon to indicate a full slice from beginning to end. 
# This is just like slicing characters in a list in python. Then we can add the column name as the 
# second parameter as a string. If we wanted to include multiple columns, we could do so in a list. 
# and Pandas will bring back only the columns we have asked for.

# names and scores for all schools using the .loc operator.
df.loc[:,['Name', 'Score']]
# : -- all the rows. in this case the indices
# ['Name', 'Score'] -- interested in the names and score cols for each row



Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


In [55]:
# ============ dropping data. ==============
# It's easy to delete data in Series and DataFrames, and we can use the ***drop function*** to delete. 
# drop() takes a single parameter, which is the ***index or row label***, to drop. 

# the drop function doesn't **change** the DataFrame by default! 
# the drop function returns to you a copy of the DataFrame with the given rows removed.

df.drop('school1') # returned series after dropping school1

# df # original data remains intact

Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


In [76]:
# Drop has two interesting optional parameters. 

# The first is called inplace, and if it's 
# set to true, the DataFrame will be **updated** in place, instead of a copy being returned. 

# The second parameter is the axes, which should be dropped. By default, this value is 0, 
# indicating the row axis. But you could change it to 1 if you want to drop a column.

# For example, lets make a copy of a DataFrame using .copy()
copy_df = df.copy()

# Now lets drop the name column in this copy
copy_df.drop("Name", inplace=True, axis=1)   # NB. axis = 0 --- row  axis= 1 -- col
copy_df # affects the main df

Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


In [77]:
# There is a second way to drop a column, and that's directly through the use of the indexing 
# operator, using the del keyword. This way of dropping data, however, takes immediate effect 
# on the DataFrame and does not return a view.

del copy_df['Score'] # applicable to just the col
copy_df # affects the main df

Unnamed: 0,Class
school1,Physics
school2,Chemistry
school1,Biology


In [78]:
# ======== adding a new column to the DataFrame ==================
# if we wanted to add a 'class ranking' column with default value of None, 
# we could do so by using the assignment operator after the square brackets.

# This broadcasts the default value to the new column immediately.

df['ClassRanking'] = None
df

Unnamed: 0,Name,Class,Score,ClassRanking
school1,Alice,Physics,85,
school2,Jack,Chemistry,82,
school1,Helen,Biology,90,
