### Data Frames Intro

What is a dataframe? a two dimensional data structure consisting of rows and columns whereas series is a one dimensional data structure. Essentially a table. The difference between the two is the point of reference. A dataframe can be referenced by its column and row. The intersection of the two is the identifier and the requisite for two dimesnions. 

- Selecting columns.

- Add new columns with assignment.

- using insert( ) to create a new column with consitent element.


In [1]:
import pandas as pd

In [2]:
# import the csv file as a data frame
df_nba = pd.read_csv('../datasets/nba.csv') 

In [18]:
df_nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


<b>Get a count of rows and columns.</b>

<b>Note:</b> Data set has missing values. The last row is completely missing, it can be viewed using: <b> df_nba.tail( )</b>

<b>Note 2:</b> Whenever a row or col has a missing value and its numeric, the value will appear as a float.

In [19]:
df_nba.shape

(458, 9)

Get a list of column names.

In [20]:
df_nba.columns

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

Check the data types.

In [21]:
df_nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

### Shared Methods and Attributes between Series and DataFrames
Access an attribute with dot syntax and invoke a method with ( )

In [None]:
# Methods in common
.head()
.tail()
.head(10) # n = inside parentheses

In [22]:
# Attribures in common - .index
# .dtypes
# .shape
# .columns
df_nba.index

RangeIndex(start=0, stop=458, step=1)

In [23]:
# Attribures in common - .values, gives the underlying numpy array
df_nba.values

array([['Avery Bradley', 'Boston Celtics', 0.0, ..., 180.0, 'Texas',
        7730337.0],
       ['Jae Crowder', 'Boston Celtics', 99.0, ..., 235.0, 'Marquette',
        6796117.0],
       ['John Holland', 'Boston Celtics', 30.0, ..., 205.0,
        'Boston University', nan],
       ...,
       ['Tibor Pleiss', 'Utah Jazz', 21.0, ..., 256.0, nan, 2900000.0],
       ['Jeff Withey', 'Utah Jazz', 24.0, ..., 231.0, 'Kansas', 947276.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

Use value counts to see the sum of each data type.

In [26]:
# method chaining
df_nba.dtypes.value_counts()

object     5
float64    4
dtype: int64

In [27]:
# note: this does not exist on a series object
# df info will show you the non null values, note the 373 non null for college and 446 for salary
df_nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   373 non-null    object 
 8   Salary    446 non-null    float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB


### Differences between Shared Methods

In [3]:
# load a new csv called revenue
rev = pd.read_csv('../datasets/revenue.csv')
rev.head(3)

Unnamed: 0,Date,New York,Los Angeles,Miami
0,1/1/16,985,122,499
1,1/2/16,738,788,534
2,1/3/16,14,20,933


In [4]:
# make the date column the index by importing with index_col= "Date"
rev = pd.read_csv('../datasets/revenue.csv', index_col = 'Date')
rev.head(3)

Unnamed: 0_level_0,New York,Los Angeles,Miami
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/16,985,122,499
1/2/16,738,788,534
1/3/16,14,20,933


A series will take a method such as sum( ) and sum the series. If sum( ) is used in the saem way, the data frame will be summed by all columns. The data frame requires an axis call where axis = 1.

Example of the difference:

In [30]:
# the series sums by adding all elements
s = pd.Series([1,2,3])

s.sum()

6

In [31]:
# the data frame sums all cols and returns a series but if you want the rows summed, call the axis.
rev.sum()

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

In [32]:
# call the axis argument to sums horizontaly, both can be used but behave differenty. 
rev.sum(axis=1)

Date
1/1/16     1606
1/2/16     2060
1/3/16      967
1/4/16     2519
1/5/16      438
1/6/16     1935
1/7/16     1234
1/8/16     2313
1/9/16     2623
1/10/16     555
dtype: int64

### Selecting one column from a dataframe.

In [5]:
# use the nba data set again
nba = pd.read_csv('../datasets/nba.csv') 

In [34]:
# veiw the column selection
nba.columns

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [39]:
# using brackets method to return a series and using head to limit output.
# Note: with single brackets, you can only return 1 column. To select > 1, use [['col1', 'col2', 'col3' ]]
nba['Name'].head(3)

0    Avery Bradley
1      Jae Crowder
2     John Holland
Name: Name, dtype: object

### Select two or more columns from a dataframe.

In [41]:
# python needs a list of mutiple value which requires double brackets
# given there are more than 1 col, pandas cannot represnt as a series and it must be a data frame.
nba[["Name", "Team"]].head(3)

Unnamed: 0,Name,Team
0,Avery Bradley,Boston Celtics
1,Jae Crowder,Boston Celtics
2,John Holland,Boston Celtics


In [43]:
# reorder the cols by ordering the name selection
nba[["Team", "Name", "Salary", "College"]].head(3)

Unnamed: 0,Team,Name,Salary,College
0,Boston Celtics,Avery Bradley,7730337.0,Texas
1,Boston Celtics,Jae Crowder,6796117.0,Marquette
2,Boston Celtics,John Holland,,Boston University


#### Create a list for columns to select and insert the variable in the brackets.

In [45]:
colNames = ["Salary", "Team", "Name"]

nba[colNames].head()

Unnamed: 0,Salary,Team,Name
0,7730337.0,Boston Celtics,Avery Bradley
1,6796117.0,Boston Celtics,Jae Crowder
2,,Boston Celtics,John Holland
3,1148640.0,Boston Celtics,R.J. Hunter
4,5000000.0,Boston Celtics,Jonas Jerebko


### Add new columns to a dataframe.
Standard/traditional way

In [46]:
# add a sport colum by assigning it like a variable
nba["Sport"] = "Basketball"

nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Sport
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,Basketball


In [49]:
# add a league column
nba["League"] = "National Basketball Association"
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Sport,League
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball,National Basketball Association
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball,National Basketball Association
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,Basketball,National Basketball Association


#### Using insert( ) to add columns.
import the nba set again. 

Arguments are loc, column, value, allow_duplicates=False - <b>Use shift/tab after mouse is in () to see additional info</b>

In [6]:
nba = pd.read_csv('../datasets/nba.csv') 

In [51]:
# insert perm modifies the df so no need for inplce param.
nba.insert(2, column="Sport", value="Basketball")

In [52]:
# sport is now added as the 3rd column
nba.head(3)

Unnamed: 0,Name,Team,Sport,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,Basketball,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,Basketball,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,Basketball,30.0,SG,27.0,6-5,205.0,Boston University,


In [53]:
# add the league info as the 4th col
nba.insert(3, column="League", value="National Basketball Association")
# preview it
nba.head(3)

Unnamed: 0,Name,Team,Sport,League,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,Basketball,National Basketball Association,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,Basketball,National Basketball Association,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,Basketball,National Basketball Association,30.0,SG,27.0,6-5,205.0,Boston University,
