## DataFrames 1 - Intro to DataFrames Module

## Table of Contents

<ul>
    <li><a href="#1">1. Shared Methods and Attributes between Series and DataFrames</a></li>
    <li><a href="#2">2. Differences between shared Methods for Series and DataFrames</a></li>
    <li><a href="#3">3. Selecting One Column from a DataFrame</a></li>
    <li><a href="#4">4. Selecting Two or more Columns from a DataFrame</a></li>
    <li><a href="#5">5. Add a New Column to a DataFrame</a></li>
    <li><a href="#6">6. Broadcasting Operations</a></li>
    <li><a href="#7">7. A Review of value_counts() Method</a></li>
    <li><a href="#8">8. Drop rows with Null Values</a></li>
    <li><a href="#9">9. Fill Null Values with .fillna() Method</a></li>
    <li><a href="#10">10. The .astype() Method</a></li>
    <li><a href="#11">11. Sort a DataFrame with the .sort_values() Method, Part 1</a></li>
    <li><a href="#12">12. Sort a DataFrame with the .sort_values() Method, Part 2</a></li>
    <li><a href="#13">13. Sort a DataFrame with .sort_index() Method</a></li>
    <li><a href="#14">14. Rank Values with .rank() Method</a></li>
</ul>

<a id='1'></a>
### 1. Shared Methods and Attributes between Series and DataFrames

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # Show all results without print

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.__version__

'1.5.2'

In [2]:
pd.read_csv(filepath_or_buffer='nba.csv').head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [3]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')

In [4]:
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [5]:
nba.tail(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [6]:
nba.index # attributes have no ()

RangeIndex(start=0, stop=458, step=1)

In [7]:
nba.values

array([['Avery Bradley', 'Boston Celtics', 0.0, ..., 180.0, 'Texas',
        7730337.0],
       ['Jae Crowder', 'Boston Celtics', 99.0, ..., 235.0, 'Marquette',
        6796117.0],
       ['John Holland', 'Boston Celtics', 30.0, ..., 205.0,
        'Boston University', nan],
       ...,
       ['Tibor Pleiss', 'Utah Jazz', 21.0, ..., 256.0, nan, 2900000.0],
       ['Jeff Withey', 'Utah Jazz', 24.0, ..., 231.0, 'Kansas', 947276.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

In [8]:
nba.shape

(458, 9)

In [9]:
nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [10]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [11]:
nba.columns # exclusive to dataframe

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [12]:
nba.axes # exclusive to dataframe

[RangeIndex(start=0, stop=458, step=1),
 Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
        'College', 'Salary'],
       dtype='object')]

In [13]:
nba.info() # exclusive method for dataframes and provides a summary of the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   373 non-null    object 
 8   Salary    446 non-null    float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB


In [14]:
# nba.get_dtype_counts() # gives a series where index is the datatype and values is the counts of the columns. exclusive to dataframe

<a id='2'></a>
### 2. Differences between shared Methods for Series and DataFrames

In [15]:
rev = pd.read_csv(filepath_or_buffer='revenue.csv',
                 index_col = 'Date')
rev.head(3)

Unnamed: 0_level_0,New York,Los Angeles,Miami
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/16,985,122,499
1/2/16,738,788,534
1/3/16,14,20,933


In [16]:
series1 = pd.Series([1, 2, 3])
series1.sum()

6

In [17]:
rev.sum() # index labels is column from dataframe and value is the sum of column
#rev.sum(axis = "index") # index labels is column from dataframe and value is the sum of column
#rev.sum(axis = 0) # index labels is column from dataframe and value is the sum of column

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

In [18]:
rev.sum(axis = 1) # index labels will be original index and total will be across all columns
#rev.sum(axis = "columns") # index labels will be original index and total will be across all columns

Date
1/1/16     1606
1/2/16     2060
1/3/16      967
1/4/16     2519
1/5/16      438
1/6/16     1935
1/7/16     1234
1/8/16     2313
1/9/16     2623
1/10/16     555
dtype: int64

<a id='3'></a>
### 3. Selecting One Column from a DataFrame

In [19]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [20]:
# Approach 1 to select one column

nba.Name.head() # returns the Name column as a series
nba.Number.head() # returns the Number column as series

Output = None # to not show the output

0    Avery Bradley
1      Jae Crowder
2     John Holland
3      R.J. Hunter
4    Jonas Jerebko
Name: Name, dtype: object

0     0.0
1    99.0
2    30.0
3    28.0
4     8.0
Name: Number, dtype: float64

In [21]:
# Approach 2 to select one column

nba["Name"].head() # returns the Name column as a series
nba["Number"].head() # returns the Number column as a series

Output = None # to not show the output

0    Avery Bradley
1      Jae Crowder
2     John Holland
3      R.J. Hunter
4    Jonas Jerebko
Name: Name, dtype: object

0     0.0
1    99.0
2    30.0
3    28.0
4     8.0
Name: Number, dtype: float64

In [22]:
type(nba["Name"])

pandas.core.series.Series

In [23]:
nba["Name"].head(n=4)

0    Avery Bradley
1      Jae Crowder
2     John Holland
3      R.J. Hunter
Name: Name, dtype: object

<a id='4'></a>
### 4. Selecting Two or more Columns from a DataFrame

In [24]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [25]:
nba[["Name", "Team", "Position"]].head(n=3) # returns multiple columns as a dataframe

Unnamed: 0,Name,Team,Position
0,Avery Bradley,Boston Celtics,PG
1,Jae Crowder,Boston Celtics,SF
2,John Holland,Boston Celtics,SG


In [26]:
type(nba[["Name", "Team", "Position"]])

pandas.core.frame.DataFrame

In [27]:
nba[["Team", "Name"]].head(n=3) # can pull columns in any order

Unnamed: 0,Team,Name
0,Boston Celtics,Avery Bradley
1,Boston Celtics,Jae Crowder
2,Boston Celtics,John Holland


In [28]:
nba[["Number", "College"]].head(n=3)

Unnamed: 0,Number,College
0,0.0,Texas
1,99.0,Marquette
2,30.0,Boston University


In [29]:
nba[["Salary", "Team", "Name"]].tail(n=3)

Unnamed: 0,Salary,Team,Name
455,2900000.0,Utah Jazz,Tibor Pleiss
456,947276.0,Utah Jazz,Jeff Withey
457,,,


In [30]:
select_cols = ["Salary", "Team", "Name"]
nba[select_cols].tail(n=3)

Unnamed: 0,Salary,Team,Name
455,2900000.0,Utah Jazz,Tibor Pleiss
456,947276.0,Utah Jazz,Jeff Withey
457,,,


<a id='5'></a>
### 5. Add a New Column to a DataFrame

nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)

In [31]:
nba["Sport"] = "Basketball" # Scalar value - every value in Sport column will be equal to Basketball. Column will always be added to end of dataframe

In [32]:
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Sport
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,Basketball


In [33]:
nba["League"] = "National Basketball Association"

In [34]:
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Sport,League
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball,National Basketball Association
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball,National Basketball Association
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,Basketball,National Basketball Association


In [35]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


#### Adding column through a method
- Allows column to be added at any location of the dataframe - using loc parameter

In [36]:
nba.insert(loc = 3, column = "Sport", value = "Basketball") # inplace is True
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Sport,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,Basketball,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,Basketball,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,Basketball,SG,27.0,6-5,205.0,Boston University,


In [37]:
nba.insert(loc = 7, column = "League", value = "National Basketball Association")
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Sport,Position,Age,Height,League,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,Basketball,PG,25.0,6-2,National Basketball Association,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,Basketball,SF,25.0,6-6,National Basketball Association,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,Basketball,SG,27.0,6-5,National Basketball Association,205.0,Boston University,


<a id='6'></a>
### 6. Broadcasting Operations

In [38]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [39]:
nba["Age"].head(3)
print("-"*50)
nba["Age"].add(5).head(3) # adds 5 to all records of Age column. null and blank values dont change
print("-"*50)
(nba["Age"] + 5).head(3) # adds 5 to all records of Age column. null and blank values dont change

0    25.0
1    25.0
2    27.0
Name: Age, dtype: float64

--------------------------------------------------


0    30.0
1    30.0
2    32.0
Name: Age, dtype: float64

--------------------------------------------------


0    30.0
1    30.0
2    32.0
Name: Age, dtype: float64

In [40]:
nba["Salary"].head(3)
print("-"*50)
nba["Salary"].sub(5000000).head(3) # subtracts 5000000 to all records of Salary column. null and blank values dont change
print("-"*50)
(nba["Salary"] - 5000000).head(3) # subtracts 5000000 to all records of Salary column. null and blank values dont change

0    7730337.0
1    6796117.0
2          NaN
Name: Salary, dtype: float64

--------------------------------------------------


0    2730337.0
1    1796117.0
2          NaN
Name: Salary, dtype: float64

--------------------------------------------------


0    2730337.0
1    1796117.0
2          NaN
Name: Salary, dtype: float64

In [41]:
nba["Weight"].head(3)
print("-"*50)
nba["Weight"].mul(0.453592).head(3) # multiplies all records of Weight column. null and blank values dont change
print("-"*50)
(nba["Weight"] * 0.453592).head(3) # multiplies all records of Weight column. null and blank values dont change

0    180.0
1    235.0
2    205.0
Name: Weight, dtype: float64

--------------------------------------------------


0     81.64656
1    106.59412
2     92.98636
Name: Weight, dtype: float64

--------------------------------------------------


0     81.64656
1    106.59412
2     92.98636
Name: Weight, dtype: float64

In [42]:
nba["Weight in Kilograms1"] = nba["Weight"].mul(0.453592) 
nba.head(n=3)

nba.insert(loc = 3, column = "Weight in Kilograms2", value = nba["Weight"].mul(0.453592)) # inplace is True
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Weight in Kilograms1
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,81.64656
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,106.59412
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,92.98636


Unnamed: 0,Name,Team,Number,Weight in Kilograms2,Position,Age,Height,Weight,College,Salary,Weight in Kilograms1
0,Avery Bradley,Boston Celtics,0.0,81.64656,PG,25.0,6-2,180.0,Texas,7730337.0,81.64656
1,Jae Crowder,Boston Celtics,99.0,106.59412,SF,25.0,6-6,235.0,Marquette,6796117.0,106.59412
2,John Holland,Boston Celtics,30.0,92.98636,SG,27.0,6-5,205.0,Boston University,,92.98636


In [43]:
nba["Salary"].head(3)
print("-"*50)
nba["Salary"].div(1000000).head(3) # divides all records of Salary column. null and blank values dont change
print("-"*50)
(nba["Weight"] / 1000000).head(3) # multiplies all records of Salary column. null and blank values dont change

0    7730337.0
1    6796117.0
2          NaN
Name: Salary, dtype: float64

--------------------------------------------------


0    7.730337
1    6.796117
2         NaN
Name: Salary, dtype: float64

--------------------------------------------------


0    0.000180
1    0.000235
2    0.000205
Name: Weight, dtype: float64

In [44]:
nba["Salary in Millions1"] = nba["Salary"].div(1000000) 
nba.head(n=3)

nba.insert(loc = 3, column = "Salary in Millions2", value = nba["Salary"].mul(1000000)) # inplace is True
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Weight in Kilograms2,Position,Age,Height,Weight,College,Salary,Weight in Kilograms1,Salary in Millions1
0,Avery Bradley,Boston Celtics,0.0,81.64656,PG,25.0,6-2,180.0,Texas,7730337.0,81.64656,7.730337
1,Jae Crowder,Boston Celtics,99.0,106.59412,SF,25.0,6-6,235.0,Marquette,6796117.0,106.59412,6.796117
2,John Holland,Boston Celtics,30.0,92.98636,SG,27.0,6-5,205.0,Boston University,,92.98636,


Unnamed: 0,Name,Team,Number,Salary in Millions2,Weight in Kilograms2,Position,Age,Height,Weight,College,Salary,Weight in Kilograms1,Salary in Millions1
0,Avery Bradley,Boston Celtics,0.0,7730337000000.0,81.64656,PG,25.0,6-2,180.0,Texas,7730337.0,81.64656,7.730337
1,Jae Crowder,Boston Celtics,99.0,6796117000000.0,106.59412,SF,25.0,6-6,235.0,Marquette,6796117.0,106.59412,6.796117
2,John Holland,Boston Celtics,30.0,,92.98636,SG,27.0,6-5,205.0,Boston University,,92.98636,


<a id='7'></a>
### 7. A Review of value_counts() Method

In [45]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [46]:
nba["Team"].value_counts().head(n=3)
nba["Position"].value_counts().head(n=3)
nba["Weight"].value_counts().head(n=3)
nba["Salary"].value_counts().head(n=3)

New Orleans Pelicans    19
Memphis Grizzlies       18
New York Knicks         16
Name: Team, dtype: int64

SG    102
PF    100
PG     92
Name: Position, dtype: int64

220.0    29
240.0    28
250.0    26
Name: Weight, dtype: int64

947276.0    31
845059.0    18
525093.0    13
Name: Salary, dtype: int64

<a id='8'></a>
### 8. Drop rows with Null Values

In [47]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)
nba.tail(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [48]:
nba.dropna().head() #removes any rows that have any null values. inplace is False

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0


In [49]:
# how = 'any' will remove rows with any null values
# how = 'all' will remove rows with all null values
nba.dropna(how = "all", inplace = True)
nba.tail(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [50]:
# to remove any columns with null values
nba.dropna(how = "any", axis = 1).head(3) #use axis = 1 for columns or use axis = column. inplace is false by default
nba.dropna(how = "any", axis = "columns", inplace = True)
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0


In [51]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)
nba.tail(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [52]:
nba.dropna(subset=["Salary"]).head(n=3) #will drop rows only if Salary column has null values

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0


In [53]:
nba.dropna(subset=["Salary", "College"]).head(n=3) #will drop rows only if Salary or College column has null values

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0


<a id='9'></a>
### 9. Fill Null Values with .fillna() Method

In [54]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)
nba.tail(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [55]:
nba.fillna(value = 0).head(n=5) # Will replace all NaN values with 0 irrespective of column. Inplace is False

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,0,5000000.0


In [56]:
nba["Salary"].fillna(value = 0, inplace = True) # Will fill NaN in Salary column with 0
nba.head(n=5)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [57]:
nba["College"].fillna(value = "No College", inplace = True)
nba.head(n=5)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,No College,5000000.0


<a id='10'></a>
### 10. The .astype() Method 
- Converts datatype of a series from one type to another
- Only works if the series has no NaN values

In [58]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)
nba.tail(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [59]:
# Drop rows with all null values
nba = nba.dropna(how = "all")
nba.head(n=3)
nba.tail(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [60]:
# Fill columns
nba["Salary"].fillna(value = 0, inplace = True)
nba["College"].fillna(value = "None", inplace = True)

In [61]:
nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [62]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   457 non-null    object 
 8   Salary    457 non-null    float64
dtypes: float64(4), object(5)
memory usage: 35.7+ KB


In [63]:
nba["Salary"] = nba["Salary"].astype(dtype = "int") # has no inplace parameter. Have to reassign
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0


In [64]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   457 non-null    object 
 8   Salary    457 non-null    int32  
dtypes: float64(3), int32(1), object(5)
memory usage: 33.9+ KB


In [65]:
nba["Age"] = nba["Age"].astype(dtype = "int")
nba["Number"] = nba["Number"].astype(dtype = "int")
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    int32  
 3   Position  457 non-null    object 
 4   Age       457 non-null    int32  
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   457 non-null    object 
 8   Salary    457 non-null    int32  
dtypes: float64(1), int32(3), object(5)
memory usage: 30.3+ KB


In [66]:
# Category data is useful if a column has a small number of unique values (eg: Gender, Month, etc)
# Helps in reducing memory usage

nba.info()
print("Total Unique Positions: {}".format(nba["Position"].nunique()))
print("Total Unique Teams: {}".format(nba["Team"].nunique()))
nba["Position"] = nba["Position"].astype(dtype = "category")  
nba["Team"] = nba["Team"].astype(dtype = "category")
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    int32  
 3   Position  457 non-null    object 
 4   Age       457 non-null    int32  
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   457 non-null    object 
 8   Salary    457 non-null    int32  
dtypes: float64(1), int32(3), object(5)
memory usage: 30.3+ KB
Total Unique Positions: 5
Total Unique Teams: 30
<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Name      457 non-null    object  
 1   Team      457 non-null    category
 2   Number    457 non-null    int32   
 3   Position  457 non-null    category
 4   Age   

<a id='11'></a>
### 11. Sort a `DataFrame` with the .sort_values() Method, Part 1
- Sorting by Single column

In [67]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [68]:
nba.sort_values(by = "Name", ascending = False).head(n=3) # will sort by name descending, inplace is False by default

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
237,Zaza Pachulia,Dallas Mavericks,27.0,C,32.0,6-11,275.0,,5200000.0
271,Zach Randolph,Memphis Grizzlies,50.0,PF,34.0,6-9,260.0,Michigan State,9638555.0
402,Zach LaVine,Minnesota Timberwolves,8.0,PG,21.0,6-5,189.0,UCLA,2148360.0


In [69]:
nba.sort_values(by = "Age", ascending = False).head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
304,Andre Miller,San Antonio Spurs,24.0,PG,40.0,6-3,200.0,Utah,250750.0
400,Kevin Garnett,Minnesota Timberwolves,21.0,PF,40.0,6-11,240.0,,8500000.0
298,Tim Duncan,San Antonio Spurs,21.0,C,40.0,6-11,250.0,Wake Forest,5250000.0


In [70]:
nba.sort_values(by = "Salary", ascending = False, inplace = True)
nba.head(n = 5)
nba.tail(n = 5) #NaN values will go to the end by defult while sorting can be fixed by na_position

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000.0
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500.0
33,Carmelo Anthony,New York Knicks,7.0,SF,32.0,6-8,240.0,Syracuse,22875000.0
251,Dwight Howard,Houston Rockets,12.0,C,30.0,6-11,265.0,,22359364.0
339,Chris Bosh,Miami Heat,1.0,PF,32.0,6-11,235.0,Georgia Tech,22192730.0


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
350,Briante Weber,Miami Heat,12.0,PG,23.0,6-2,165.0,Virginia Commonwealth,
353,Dorell Wright,Miami Heat,11.0,SF,30.0,6-9,205.0,,
397,Axel Toupane,Denver Nuggets,6.0,SG,23.0,6-7,210.0,,
409,Greg Smith,Minnesota Timberwolves,4.0,PF,25.0,6-10,250.0,Fresno State,
457,,,,,,,,,


In [71]:
nba.sort_values(by = "Salary", ascending = False, inplace = True, na_position = "first")
nba.head(n = 5)
nba.tail(n = 5) #NaN values will go to the end by defult while sorting can be fixed by naposition

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
46,Elton Brand,Philadelphia 76ers,42.0,PF,37.0,6-9,254.0,Duke,
171,Dahntay Jones,Cleveland Cavaliers,30.0,SG,35.0,6-6,225.0,Duke,
264,Jordan Farmar,Memphis Grizzlies,4.0,PG,29.0,6-2,180.0,UCLA,
269,Ray McCallum,Memphis Grizzlies,5.0,PG,24.0,6-3,190.0,Detroit,


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
175,Jordan McRae,Cleveland Cavaliers,12.0,SG,25.0,6-5,179.0,Tennessee,111196.0
135,Alan Williams,Phoenix Suns,15.0,C,23.0,6-8,260.0,UC Santa Barbara,83397.0
291,Orlando Johnson,New Orleans Pelicans,0.0,SG,27.0,6-5,220.0,UC Santa Barbara,55722.0
130,Phil Pressey,Phoenix Suns,25.0,PG,25.0,5-11,175.0,Missouri,55722.0
32,Thanasis Antetokounmpo,New York Knicks,43.0,SF,23.0,6-7,205.0,,30888.0


<a id='12'></a>
### 12. Sort a `DataFrame` with the .sort_values() Method, Part 2
- Sorting by Multiple columns

In [72]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [73]:
nba.sort_values(by = ["Team", "Name"], ascending = False).head(n = 5) # Will sort Team and Name both in descending order

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
379,Ramon Sessions,Washington Wizards,7.0,PG,30.0,6-3,190.0,Nevada,2170465.0
378,Otto Porter Jr.,Washington Wizards,22.0,SF,23.0,6-8,198.0,Georgetown,4662960.0
375,Nene Hilario,Washington Wizards,42.0,C,33.0,6-11,250.0,,13000000.0
376,Markieff Morris,Washington Wizards,5.0,PF,26.0,6-10,245.0,Kansas,8000000.0
381,Marcus Thornton,Washington Wizards,15.0,SF,29.0,6-4,205.0,LSU,200600.0


In [74]:
nba.sort_values(by = ["Team", "Name"], ascending = [True, False]).head(n = 5) # Here Team will be sorted in Asc and Name will be sorted in Desc

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
322,Walter Tavares,Atlanta Hawks,22.0,C,24.0,7-3,260.0,,1000000.0
310,Tim Hardaway Jr.,Atlanta Hawks,10.0,SG,24.0,6-6,205.0,Michigan,1304520.0
321,Tiago Splitter,Atlanta Hawks,11.0,C,31.0,6-11,245.0,,9756250.0
320,Thabo Sefolosha,Atlanta Hawks,25.0,SF,32.0,6-7,220.0,,4000000.0
315,Paul Millsap,Atlanta Hawks,4.0,PF,31.0,6-8,246.0,Louisiana Tech,18671659.0


In [75]:
nba.sort_values(by = ["Team", "Name"], ascending = [True, False], inplace = True) # by default inplace is False
nba.head(n=5)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
322,Walter Tavares,Atlanta Hawks,22.0,C,24.0,7-3,260.0,,1000000.0
310,Tim Hardaway Jr.,Atlanta Hawks,10.0,SG,24.0,6-6,205.0,Michigan,1304520.0
321,Tiago Splitter,Atlanta Hawks,11.0,C,31.0,6-11,245.0,,9756250.0
320,Thabo Sefolosha,Atlanta Hawks,25.0,SF,32.0,6-7,220.0,,4000000.0
315,Paul Millsap,Atlanta Hawks,4.0,PF,31.0,6-8,246.0,Louisiana Tech,18671659.0


<a id='13'></a>
### 13. Sort a `DataFrame` with .sort_index() Method

In [76]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [77]:
nba.sort_values(by = ["Number", "Salary", "Name"], ascending = True, inplace = True)
nba.head(n=3)
nba.tail(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
291,Orlando Johnson,New Orleans Pelicans,0.0,SG,27.0,6-5,220.0,UC Santa Barbara,55722.0
248,Andrew Goudelock,Houston Rockets,0.0,PG,27.0,6-3,200.0,Charleston,200600.0
347,Josh Richardson,Miami Heat,0.0,SG,22.0,6-6,200.0,Tennessee,525093.0


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
68,Lucas Nogueira,Toronto Raptors,92.0,C,23.0,7-0,220.0,,1842000.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
457,,,,,,,,,


In [78]:
nba.sort_index(ascending = False, inplace = True) #sorts by index. Inplace is False by default
nba.head(n = 5)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
457,,,,,,,,,
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0


<a id='14'></a>
### 14. Rank Values with `.rank()` Method
- Make sure to remove NaN values

In [79]:
nba = pd.read_csv(filepath_or_buffer='nba.csv')
nba["Salary"] = nba["Salary"].fillna(value = 0)
nba["Salary"] = nba["Salary"].astype(dtype = "int")
nba.head(n=3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0


In [80]:
# By default ranks smallest to largest. And rank gives a float by default
nba["Salary_Rank"] = nba["Salary"].rank(ascending = False).astype(dtype = "int") 
nba.head(n = 3)
nba.sort_values(by = "Salary", ascending = False).head(n = 3)
nba.sort_values(by = "Salary_Rank", ascending = True).head(n = 3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Salary_Rank
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337,97
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117,110
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0,452


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Salary_Rank
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000,1
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500,2
33,Carmelo Anthony,New York Knicks,7.0,SF,32.0,6-8,240.0,Syracuse,22875000,3


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Salary_Rank
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000,1
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500,2
33,Carmelo Anthony,New York Knicks,7.0,SF,32.0,6-8,240.0,Syracuse,22875000,3
