A `DateFrame` is a two-dimensional object. It is analogous to a table.

In [1]:
import pandas as pd

In [2]:
nba = pd.read_csv('pandas/nba.csv')

-----
## Shared attributes and methods between `Series` and `Dataframe`
### `.head(..)` and `.tail(..)` methods

In [3]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [4]:
nba.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


### `.index` attribute

In [5]:
nba.index

RangeIndex(start=0, stop=458, step=1)

### `.value` attribute
Returns multi-dimensional array
- Every row is returned as an array, part of an outer array representing the table

In [6]:
nba.values

array([['Avery Bradley', 'Boston Celtics', 0.0, ..., 180.0, 'Texas',
        7730337.0],
       ['Jae Crowder', 'Boston Celtics', 99.0, ..., 235.0, 'Marquette',
        6796117.0],
       ['John Holland', 'Boston Celtics', 30.0, ..., 205.0,
        'Boston University', nan],
       ...,
       ['Tibor Pleiss', 'Utah Jazz', 21.0, ..., 256.0, nan, 2900000.0],
       ['Jeff Withey', 'Utah Jazz', 24.0, ..., 231.0, 'Kansas', 947276.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

### `.shape` attribute

In [7]:
nba.shape

(458, 9)

### `.dtypes` attribute
Returns the data type of the columms of the `DataFrame` <br/>
**NOTE:** A `Series` is returned, in which all the columns of the `DataFrame` represent indices of the Series, and their data type is represented by the `Series` values.

In [8]:
nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

---
## Attributes and methods available for `DataFrame`

### .columns, .axes attributes

In [9]:
nba.columns

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [10]:
nba.axes

[RangeIndex(start=0, stop=458, step=1),
 Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
        'College', 'Salary'],
       dtype='object')]

### `.info(..)` method
Provides a summary of the `DataFrame`

In [11]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null object
Number      457 non-null float64
Position    457 non-null object
Age         457 non-null float64
Height      457 non-null object
Weight      457 non-null float64
College     373 non-null object
Salary      446 non-null float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB


### `.get_dtype_counts(..)` method
- Returns the count of different datatypes that the columns belong to.
- For example, 5 columns are `int64` type, 3 columns are `float32` type
- Note that this information is already a part of the `.info(..)` method

In [12]:
nba.get_dtype_counts()

float64    4
object     5
dtype: int64

---
## Different behavior of the methods shared between `Series` and `DataFrame`

In [13]:
rev = pd.read_csv('pandas/revenue.csv', index_col='Date')
rev.head(3)

Unnamed: 0_level_0,New York,Los Angeles,Miami
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/16,985,122,499
1/2/16,738,788,534
1/3/16,14,20,933


### Calling `.sum(..)` on a `Series` versus on a `DataFrame`

In [14]:
s = pd.Series([1,2,3,4])
s.sum()

10

In [15]:
rev.sum() # Equivalent to rev.sum(axis=0) or rev.sum(axis='rows')

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

**NOTE:** A `Series` is returned, in which columns from the `DataFrame` represent the indices of the `Series`, and sum is computed for all the columns that have numerical values

But what if we want the sum across rows instead of a columns? In such a case, use the **`axis`** attribute.

In [16]:
rev.sum(axis=1) # Equivalent to rev.sum(axis = 'columns')

Date
1/1/16     1606
1/2/16     2060
1/3/16      967
1/4/16     2519
1/5/16      438
1/6/16     1935
1/7/16     1234
1/8/16     2313
1/9/16     2623
1/10/16     555
dtype: int64

### Selecting a column from the `DataFrame`
The following method of simply writing the Column name after the dot
- returns a `Series`
- is NOT the recommended way because it does not work if the column names have whitespaces

In [17]:
nba.Name

0                Avery Bradley
1                  Jae Crowder
2                 John Holland
3                  R.J. Hunter
4                Jonas Jerebko
5                 Amir Johnson
6                Jordan Mickey
7                 Kelly Olynyk
8                 Terry Rozier
9                 Marcus Smart
10             Jared Sullinger
11               Isaiah Thomas
12                 Evan Turner
13                 James Young
14                Tyler Zeller
15            Bojan Bogdanovic
16                Markel Brown
17             Wayne Ellington
18     Rondae Hollis-Jefferson
19                Jarrett Jack
20              Sergey Karasev
21             Sean Kilpatrick
22                Shane Larkin
23                 Brook Lopez
24            Chris McCullough
25                 Willie Reed
26             Thomas Robinson
27                  Henry Sims
28                Donald Sloan
29              Thaddeus Young
                ...           
428            Al-Farouq Aminu
429     

#### Using the bracket syntax
Will work even if there are white spaces in the column names

In [18]:
nba['Name'].head(3)

0    Avery Bradley
1      Jae Crowder
2     John Holland
Name: Name, dtype: object

### Selecting two or more columns from a `DataFrame`
Another `DataFrame` is returned.

In [19]:
nba[['Salary', 'Name']].head(3) # Note, output appears in the same the order in which column names are provided.

Unnamed: 0,Salary,Name
0,7730337.0,Avery Bradley
1,6796117.0,Jae Crowder
2,,John Holland


## Adding new columns to a `DataFrame`
### Method 1:

In [20]:
nba['Sport'] = 'Basketball'
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Sport
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,Basketball


### Method 2: Using the `.insert(..)` method
Automatically assigns back the value, hence no need to reassign or use inplace parameter.

In [21]:
nba.insert(loc=3,column='sport2', value='Basketball2')
nba.head(3)

Unnamed: 0,Name,Team,Number,sport2,Position,Age,Height,Weight,College,Salary,Sport
0,Avery Bradley,Boston Celtics,0.0,Basketball2,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball
1,Jae Crowder,Boston Celtics,99.0,Basketball2,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball
2,John Holland,Boston Celtics,30.0,Basketball2,SG,27.0,6-5,205.0,Boston University,,Basketball


---
## Broadcasting operations
Applying methods to each value one by one

In [22]:
nba = pd.read_csv('pandas/nba.csv')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


**Adding 5 to the age of all the players** <br/>
If the value is Null, then the addition will still result in a Null, and not an error.

In [23]:
nba['Age'].add(5).head(5)

0    30.0
1    30.0
2    32.0
3    27.0
4    34.0
Name: Age, dtype: float64

**Similarly, `.sub(..)`, `.mul(..)`, `.div(..)` etc. are also available. **

---
## Dropping rows with null values

In [24]:
nba = pd.read_csv('pandas/nba.csv')
nba.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


### `.dropna(..)` method
Simply calling `.dropna(..)` method on a `DataFrame` will remove all those rows in entirety that have one or more column values as null.

**NOTE:** The `inplace` parameter can be used.

In [26]:
nba.dropna().tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


#### `.dropna(how='all')`
Drops only those rows that have **ALL** the columns as null.

#### `.dropna(how='any', axis=1)`
To drop the entire COLUMN (instead of row) that has one or more null values.

#### `.dropna(.., subset = )` parameter
**Drop a row if the value in a particular column of that row is null.** <br/>
The below code statement drops any row that has a null in the `Salary` column. It does not matter if any other column has a null value or not.

In [28]:
nba.dropna(subset=['Salary']).head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0


---
## Filling in the null values with `.fillna(..)` method
It is **NOT** recommended to execute the `.fillna(..)` method on the entire `DataFrame`.

For example, if we execute `nba.fillna(0)`, all the `NaN` values from the `Salary` column will be replaced by 0.0 (which is intended). However, it will also replace the `NaN` values in the `College` column by **0**, which is undesirable as it does not make any sense to have a **0** in the `College` column. 

In [29]:
nba = pd.read_csv('pandas/nba.csv')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [30]:
nba['Salary'].fillna(value=0, inplace=True)
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0


**Similarly looking at the `College` column.**

In [32]:
nba['College'].head(6)

0                Texas
1            Marquette
2    Boston University
3        Georgia State
4                  NaN
5                  NaN
Name: College, dtype: object

In [33]:
nba['College'].fillna(value='Not available', inplace=True)
nba['College'].head(6)

0                Texas
1            Marquette
2    Boston University
3        Georgia State
4        Not available
5        Not available
Name: College, dtype: object

## `.astype(..)` method to convert the data type of a `Series`
**NOTE:** The Series should **NOT** have any NULL values.

In [37]:
nba = pd.read_csv('pandas/nba.csv').dropna(how='all')
nba['Salary'].fillna(0, inplace=True)
nba['College'].fillna('None', inplace=True)

nba.head(6)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0


Checking the current data types of all columns

In [39]:
nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

We can also use the `.info(..)` method, which in additon to giving information about the datatypes of the columns, also informs if there is any NULL in any column. <br/> Also note the current memory usage. Since we'll be converting some columns from float to int, we expect reduction in the overall memory usage for the DataFrame.

In [41]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null object
Number      457 non-null float64
Position    457 non-null object
Age         457 non-null float64
Height      457 non-null object
Weight      457 non-null float64
College     457 non-null object
Salary      457 non-null float64
dtypes: float64(4), object(5)
memory usage: 35.7+ KB


In [42]:
nba['Salary'] = nba['Salary'].astype('int')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0


In [43]:
nba[['Number', 'Age']] = nba[['Number', 'Age']].astype('int')

**TIP:** Whenever you have small number of values in a DataFrame, convert them to Categorical as it helps in reducing the memory usage significantly.

For example, if we have columns such as Gender, or Month in our DataFrame, always consider converting these columns to categorical.

In [45]:
nba['Position'] = nba['Position'].astype('category')
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null object
Number      457 non-null int32
Position    457 non-null category
Age         457 non-null int32
Height      457 non-null object
Weight      457 non-null float64
College     457 non-null object
Salary      457 non-null int32
dtypes: category(1), float64(1), int32(3), object(4)
memory usage: 27.4+ KB


In [46]:
nba['Team'] = nba['Team'].astype('category')
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null category
Number      457 non-null int32
Position    457 non-null category
Age         457 non-null int32
Height      457 non-null object
Weight      457 non-null float64
College     457 non-null object
Salary      457 non-null int32
dtypes: category(2), float64(1), int32(3), object(3)
memory usage: 25.8+ KB


---
## Sorting a DataFrame
### `.sort_values(..)` method to sort the `DataFrame` by a column

In [47]:
nba = pd.read_csv('pandas/nba.csv')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


#### Sort the values in the `DataFrame` by the `Name` column

In [53]:
nba.sort_values(by='Name').head(4)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
152,Aaron Brooks,Chicago Bulls,0.0,PG,31.0,6-0,161.0,Oregon,2250000.0
356,Aaron Gordon,Orlando Magic,0.0,PF,20.0,6-9,220.0,Arizona,4171680.0
328,Aaron Harrison,Charlotte Hornets,9.0,SG,21.0,6-6,210.0,Kentucky,525093.0
404,Adreian Payne,Minnesota Timberwolves,33.0,PF,25.0,6-10,237.0,Michigan State,1938840.0


#### Sort the values in the `DataFrame` by the `Age` column in descending order

In [54]:
nba.sort_values(by='Age', ascending=False).head(4)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
304,Andre Miller,San Antonio Spurs,24.0,PG,40.0,6-3,200.0,Utah,250750.0
400,Kevin Garnett,Minnesota Timberwolves,21.0,PF,40.0,6-11,240.0,,8500000.0
298,Tim Duncan,San Antonio Spurs,21.0,C,40.0,6-11,250.0,Wake Forest,5250000.0
261,Vince Carter,Memphis Grizzlies,15.0,SG,39.0,6-6,220.0,North Carolina,4088019.0


**NOTE: If there are any NaN values in the column that we are sorting the DataFrame by, by-default they are placed at the end.**

In [56]:
nba.sort_values('Salary').tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
397,Axel Toupane,Denver Nuggets,6.0,SG,23.0,6-7,210.0,,
409,Greg Smith,Minnesota Timberwolves,4.0,PF,25.0,6-10,250.0,Fresno State,
457,,,,,,,,,


**However, if we want the null values to be at the beginning of the column being sorted, use the `na_position = 'first'` parameter**

In [58]:
nba.sort_values('Salary', na_position='first').tail(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500.0
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000.0


In [59]:
nba.sort_values('Salary', na_position='first').head(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
46,Elton Brand,Philadelphia 76ers,42.0,PF,37.0,6-9,254.0,Duke,


---
### `.sort_values(..)` method to sort the `DataFrame` by multiple columns

In [61]:
nba = pd.read_csv('pandas/nba.csv')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


**First sort by `Team` and then by `Name`**

In [63]:
nba.sort_values(['Team', 'Name']).head(6)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
312,Al Horford,Atlanta Hawks,15.0,C,30.0,6-10,245.0,Florida,12000000.0
318,Dennis Schroder,Atlanta Hawks,17.0,PG,22.0,6-1,172.0,,1763400.0
323,Jeff Teague,Atlanta Hawks,0.0,PG,27.0,6-2,186.0,Wake Forest,8000000.0
309,Kent Bazemore,Atlanta Hawks,24.0,SF,26.0,6-5,201.0,Old Dominion,2000000.0
311,Kirk Hinrich,Atlanta Hawks,12.0,SG,35.0,6-4,190.0,Kansas,2854940.0
313,Kris Humphries,Atlanta Hawks,43.0,PF,31.0,6-9,235.0,Minnesota,1000000.0


**NOTE:** In both the cases the columns were sorted in ascending order by default.

However, If we want the `Team` to be sorted in ascending order and `Name` in descending order...

In [64]:
nba.sort_values(['Team', 'Name'], ascending=[True, False]).head(6)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
322,Walter Tavares,Atlanta Hawks,22.0,C,24.0,7-3,260.0,,1000000.0
310,Tim Hardaway Jr.,Atlanta Hawks,10.0,SG,24.0,6-6,205.0,Michigan,1304520.0
321,Tiago Splitter,Atlanta Hawks,11.0,C,31.0,6-11,245.0,,9756250.0
320,Thabo Sefolosha,Atlanta Hawks,25.0,SF,32.0,6-7,220.0,,4000000.0
315,Paul Millsap,Atlanta Hawks,4.0,PF,31.0,6-8,246.0,Louisiana Tech,18671659.0
319,Mike Scott,Atlanta Hawks,32.0,PF,27.0,6-8,237.0,Virginia,3333333.0
