# DataFrames I

In [1]:
import pandas as pd

## Methods and Attributes between Series and DataFrames
- A **DataFrame** is a 2-dimensional table consisting of rows and columns.
- Pandas uses a `NaN` designation for cells that have a missing value. It is short for "not a number". Most operations on `NaN` values will produce `NaN` values.
- Like with a **Series**, Pandas assigns an index position/label to each **DataFrame** row.
- The **DataFrame** and **Series** have common and exclusive methods/attributes.
- The `hasnans` attribute exists only a **Series**. The `columns` attribute exists only on a **DataFrame**.
- Some methods/attributes will return different types of data.
- The `info` method returns a summary of the pandas object.

In [3]:
nba = pd.read_csv("nba.csv")
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0


In [6]:
nba.head(8)

nba.tail(3)

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0
591,,,,,,,


In [None]:
nba.index # row labels

RangeIndex(start=0, stop=592, step=1)

In [None]:
nba.values # numpy array representation of the DataFrame

array([['Saddiq Bey', 'Atlanta Hawks', 'F', ..., 215.0, 'Villanova',
        4556983.0],
       ['Bogdan Bogdanovic', 'Atlanta Hawks', 'G', ..., 225.0,
        'Fenerbahce', 18700000.0],
       ['Kobe Bufkin', 'Atlanta Hawks', 'G', ..., 195.0, 'Michigan',
        4094244.0],
       ...,
       ['Tristan Vukcevic', 'Washington Wizards', 'F', ..., 220.0,
        'Real Madrid', nan],
       ['Delon Wright', 'Washington Wizards', 'G', ..., 185.0, 'Utah',
        8195122.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

In [None]:
nba.shape # (rows, columns)

(592, 7)

In [11]:
nba.dtypes # data types of each column

Name         object
Team         object
Position     object
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [12]:
nba.hasnans # checks for missing values

AttributeError: 'DataFrame' object has no attribute 'hasnans'

In [13]:
nba.columns # column labels

Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')

In [15]:
nba.axes # row and column axis labels

[RangeIndex(start=0, stop=592, step=1),
 Index(['Name', 'Team', 'Position', 'Height', 'Weight', 'College', 'Salary'], dtype='object')]

In [16]:
nba.info() # concise summary of the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 592 entries, 0 to 591
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      591 non-null    object 
 1   Team      591 non-null    object 
 2   Position  584 non-null    object 
 3   Height    585 non-null    object 
 4   Weight    584 non-null    float64
 5   College   578 non-null    object 
 6   Salary    488 non-null    float64
dtypes: float64(2), object(5)
memory usage: 32.5+ KB


## Differences between Shared Methods
- The `sum` method adds a **Series's** values.
- On a **DataFrame**, the `sum` method defaults to adding the values by traversing the index (row values).
- The `axis` parameter customizes the direction that we add across. Pass `"columns"` or `1` to add "across" the columns.

In [19]:
revenue = pd.read_csv("revenue.csv", index_col="Date")
revenue

Unnamed: 0_level_0,New York,Los Angeles,Miami
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/26,985,122,499
1/2/26,738,788,534
1/3/26,14,20,933
1/4/26,730,904,885
1/5/26,114,71,253
1/6/26,936,502,497
1/7/26,123,996,115
1/8/26,935,492,886
1/9/26,846,954,823
1/10/26,54,285,216


In [27]:
revenue.sum(axis="index") # sum of each column

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

In [28]:
revenue.sum(axis="columns") # sum of each row

Date
1/1/26     1606
1/2/26     2060
1/3/26      967
1/4/26     2519
1/5/26      438
1/6/26     1935
1/7/26     1234
1/8/26     2313
1/9/26     2623
1/10/26     555
dtype: int64

In [29]:
revenue.sum(axis="columns").sum() # sum of all values in the DataFrame

np.int64(16250)

## Select One Column from a DataFrame
- We can use attribute syntax (`df.column_name`) to select a column from a **DataFrame**. The syntax will not work if the column name has spaces.
- We can also use square bracket syntax (`df["column name"]`) which will work for any column name.
- Pandas extracts a column from a **DataFrame** as a **Series**.
- The **Series** is a view, so changes to the **Series** *will* affect the **DataFrame**.
- Pandas will display a warning if you mutate the **Series**. Use the `copy` method to create a duplicate.

In [54]:
nba = pd.read_csv("nba.csv") # reset index to default integer index
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [55]:
nba.Team # accessing a single column - is case sensitive - cannot support spaces in column names

0           Atlanta Hawks
1           Atlanta Hawks
2           Atlanta Hawks
3           Atlanta Hawks
4           Atlanta Hawks
              ...        
587    Washington Wizards
588    Washington Wizards
589    Washington Wizards
590    Washington Wizards
591                   NaN
Name: Team, Length: 592, dtype: object

In [56]:
nba["Team"] # accessing a single column using bracket notation - supports spaces in column names and chars

0           Atlanta Hawks
1           Atlanta Hawks
2           Atlanta Hawks
3           Atlanta Hawks
4           Atlanta Hawks
              ...        
587    Washington Wizards
588    Washington Wizards
589    Washington Wizards
590    Washington Wizards
591                   NaN
Name: Team, Length: 592, dtype: object

In [57]:
names = nba["Name"].copy()  # create a copy of the "Name" column
names

0             Saddiq Bey
1      Bogdan Bogdanovic
2            Kobe Bufkin
3           Clint Capela
4         Bruno Fernando
             ...        
587         Ryan Rollins
588        Landry Shamet
589     Tristan Vukcevic
590         Delon Wright
591                  NaN
Name: Name, Length: 592, dtype: object

In [58]:
names.iloc[0] = "Whatever"  # This will raise a SettingWithCopyWarning

In [59]:
names.head()

0             Whatever
1    Bogdan Bogdanovic
2          Kobe Bufkin
3         Clint Capela
4       Bruno Fernando
Name: Name, dtype: object

In [60]:
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


## Select Multiple Columns from a DataFrame
- Use square brackets with a list of names to extract multiple **DataFrame** columns.
- Pandas stores the result in a new **DataFrame** (a copy).

In [61]:
nba = pd.read_csv("nba.csv")
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [66]:
# nba[["Name", "Team"]] # accessing multiple columns

columns_to_select = ["Name", "Team"] # list of columns to select
nba[columns_to_select] # accessing multiple columns using a list variable


Unnamed: 0,Name,Team
0,Saddiq Bey,Atlanta Hawks
1,Bogdan Bogdanovic,Atlanta Hawks
2,Kobe Bufkin,Atlanta Hawks
3,Clint Capela,Atlanta Hawks
4,Bruno Fernando,Atlanta Hawks
...,...,...
587,Ryan Rollins,Washington Wizards
588,Landry Shamet,Washington Wizards
589,Tristan Vukcevic,Washington Wizards
590,Delon Wright,Washington Wizards


## Add New Column to DataFrame
- Use square bracket extraction syntax with an equal sign to add a new **Series** to a **DataFrame**.
- The `insert` method allows us to insert an element at a specific column index.
- On the right-hand side, we can reference an existing **DataFrame** column and perform a broadcasting operation on it to create the new **Series**.

In [73]:
nba = pd.read_csv("nba.csv")
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [None]:
nba["Sport"] = "Basketball"  # adding a new column with a default value

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary,Sport
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0,Basketball
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0,Basketball
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0,Basketball
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0,Basketball
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0,Basketball


In [74]:
nba.insert(loc=3, column="Sport", value="Basketball")  # adding a new column at a specific location
nba.head()

Unnamed: 0,Name,Team,Position,Sport,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,Basketball,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,Basketball,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,Basketball,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,Basketball,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,Basketball,6-10,240.0,Maryland,2581522.0


In [75]:
nba["Salary"] * 2 # multiplying a column by a scalar value

nba["Salary Doubled"] = nba["Salary"] * 2  # creating a new column based on calculations
nba.head()

Unnamed: 0,Name,Team,Position,Sport,Height,Weight,College,Salary,Salary Doubled
0,Saddiq Bey,Atlanta Hawks,F,Basketball,6-7,215.0,Villanova,4556983.0,9113966.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,Basketball,6-5,225.0,Fenerbahce,18700000.0,37400000.0
2,Kobe Bufkin,Atlanta Hawks,G,Basketball,6-5,195.0,Michigan,4094244.0,8188488.0
3,Clint Capela,Atlanta Hawks,C,Basketball,6-10,256.0,Elan Chalon,20616000.0,41232000.0
4,Bruno Fernando,Atlanta Hawks,F-C,Basketball,6-10,240.0,Maryland,2581522.0,5163044.0


## A Review of the value_counts Method
- The `value_counts` method counts the number of times that each unique value occurs in a **Series**.

In [76]:
nba = pd.read_csv("nba.csv")
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [None]:
nba["Team"].value_counts()  # counts of unique values in the "Team" column

nba["Position"].value_counts()  # counts of unique values in the "Position" column

nba["Position"].value_counts(normalize=True) * 100 # percentage of each unique value in the "Position" column

Position
G      39.212329
F      32.020548
C       8.047945
G-F     7.876712
F-C     6.335616
C-F     3.938356
F-G     2.568493
Name: proportion, dtype: float64

## Drop Rows with Missing Values
- Pandas uses a `NaN` designation for cells that have a missing value.
- The `dropna` method deletes rows with missing values. Its default behavior is to remove a row if it has *any* missing values.
- Pass the `how` parameter an argument of "all" to delete rows where all the values are `NaN`.
- The `subset` parameters customizes/limits the columns that pandas will use to drop rows with missing values.

In [81]:
nba = pd.read_csv("nba.csv")
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0


In [88]:
nba.dropna() # drops rows with any missing values

nba.dropna(how="any") # drops rows with any missing values

nba.dropna(how="all") # drops rows with all missing values

nba.dropna(subset=["College"]) # drops rows with missing values in the "College" column

nba.dropna(subset=["College", "Salary"]) # drops rows with missing values in the "College" or "Salary" columns

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
585,Eugene Omoruyi,Washington Wizards,F,6-6,235.0,Oregon,559782.0
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0


## Fill in Missing Values with the fillna Method
- The `fillna` method replaces missing `NaN` values with its argument.
- The `fillna` method is available on both **DataFrames** and **Series**.
- An extracted **Series** is a view on the original **DataFrame**, but the `fillna` method returns a copy.

In [91]:
nba = pd.read_csv("nba.csv").dropna(how = "all")  # drops rows with all missing values
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,


In [94]:
nba["Salary"] = nba["Salary"].fillna(0)  # fills missing values with 0 - this is a copy
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0.0


In [95]:
nba["College"] = nba["College"].fillna("No College")  # fills missing values with "No College"
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0.0


## The astype Method I
- The `astype` method converts a **Series's** values to a specified type.
- Pass in the specified type as either a string or the core Python data type.
- Pandas cannot convert `NaN` values to numeric types, so we need to eliminate/replace them before we perform the conversion.
- The `dtypes` attribute returns a **Series** with the **DataFrame's** columns and their types.

In [105]:
nba = pd.read_csv("nba.csv")
nba["Salary"] = nba["Salary"].fillna(0)
nba["Weight"] = nba["Weight"].fillna(0)
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0.0
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0


In [106]:
nba.dtypes

Name         object
Team         object
Position     object
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [107]:
nba["Salary"] = nba["Salary"].astype("int")  # converting the "Salary" column to integer type
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122


In [110]:
nba.dtypes

Name        object
Team        object
Position    object
Height      object
Weight       int64
College     object
Salary       int64
dtype: object

In [109]:
nba["Weight"] = nba["Weight"].astype("int")  # converting the "Weight" column to integer type
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215,Villanova,4556983
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225,Fenerbahce,18700000
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195,Michigan,4094244
3,Clint Capela,Atlanta Hawks,C,6-10,256,Elan Chalon,20616000
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240,Maryland,2581522
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180,Toledo,1719864
588,Landry Shamet,Washington Wizards,G,6-4,190,Wichita State,10250000
589,Tristan Vukcevic,Washington Wizards,F,6-10,220,Real Madrid,0
590,Delon Wright,Washington Wizards,G,6-5,185,Utah,8195122


## The astype Method II
- The `category` type is ideal for columns with a limited number of unique values.
- The `nunique` method will return a **Series** with the number of unique values in each column.
- With categories, pandas does not create a separate value in memory for each "cell". Rather, the cells point to a single copy for each unique value.

In [111]:
nba = pd.read_csv("nba.csv")
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [112]:
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0


In [None]:
nba["Team"].nunique() # number of unique teams in the "Team" column
nba.nunique() # number of unique values in each column

Name        591
Team         30
Position      7
Height       20
Weight       93
College     182
Salary      298
dtype: int64

In [116]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 592 entries, 0 to 591
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      591 non-null    object 
 1   Team      591 non-null    object 
 2   Position  584 non-null    object 
 3   Height    585 non-null    object 
 4   Weight    584 non-null    float64
 5   College   578 non-null    object 
 6   Salary    488 non-null    float64
dtypes: float64(2), object(5)
memory usage: 32.5+ KB


In [119]:
nba["Position"] = nba["Position"].astype("category")  # converting the "Position" column to categorical type
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,
590,Delon Wright,Washington Wizards,G,6-5,185.0,Utah,8195122.0


In [123]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 592 entries, 0 to 591
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Name      591 non-null    object  
 1   Team      591 non-null    category
 2   Position  584 non-null    category
 3   Height    585 non-null    object  
 4   Weight    584 non-null    float64 
 5   College   578 non-null    object  
 6   Salary    488 non-null    float64 
dtypes: category(2), float64(2), object(3)
memory usage: 26.0+ KB


In [122]:
nba["Team"] = nba["Team"].astype("category") # converting the "Team" column to categorical type

## Sort a DataFrame with the sort_values Method I
- The `sort_values` method sorts a **DataFrame** by the values in one or more columns. The default sort is an ascending one (alphabetical for strings).
- The first parameter (`by`) expects the column(s) to sort by.
- If sorting by a single column, pass a string with its name.
- The `ascending` parameter customizes the sort order.
- The `na_position` parameter customizes where pandas places `NaN` values.

In [124]:
nba = pd.read_csv("nba.csv")
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [None]:
nba.sort_values(by="Name", ascending=False) # sorting the DataFrame by the "Name" column in ascending order

nba.sort_values("Salary", ascending=False, na_position='last') # sorting the DataFrame by the "Salary" column in ascending order, then null values at the end

nba.sort_values("Salary", ascending=False, na_position='first') # sorting the DataFrame by the "Salary" column in ascending order, then null values at the beginning

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
23,Blake Griffin,Boston Celtics,F,6-9,250.0,Oklahoma,
26,Mfiondu Kabengele,Boston Celtics,C,6-10,250.0,Florida State,
28,Svi Mykhailiuk,Boston Celtics,G-F,6-7,205.0,Kansas,
35,Robert Williams III,Boston Celtics,C-F,6-9,237.0,Texas A&M,
39,Nic Claxton,Brooklyn Nets,C,6-11,215.0,Georgia,
...,...,...,...,...,...,...,...
336,Lindell Wigginton,Milwaukee Bucks,G,6-1,189.0,Iowa State,559782.0
143,Jay Huff,Denver Nuggets,C,7-1,240.0,Virginia,559782.0
244,Jordan Miller,Los Angeles Clippers,G,6-7,194.0,Miami,559782.0
147,Braxton Key,Denver Nuggets,F,6-8,225.0,Virginia,559782.0


## Sort a DataFrame with the sort_values Method II
- To sort by multiple columns, pass the `by` parameter a list of column names. Pandas will sort in the specified column order (first to last).
- Pass the `ascending` parameter a Boolean to sort all columns in a consistent order (all ascending or all descending).
- Pass `ascending` a list to customize the sort order *per* column. The `ascending` list length must match the `by` list.

## Sort a DataFrame by its Index
- The `sort_index` method sorts the **DataFrame** by its index positions/labels.

In [133]:
nba = pd.read_csv("nba.csv")
nba.head()

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0


In [None]:
nba.sort_values(["Team", "Name"], ascending=False) # sorting the DataFrame by the "Team" column and then by the "Name" column in ascending order

nba.sort_values(["Team", "Name"], ascending=[True, False]) # sorting the DataFrame by the "Team" column in ascending order and then by the "Name" column in descending order

nba.sort_values(["Team", "Name"], ascending=False)

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
12,Wesley Matthews,Atlanta Hawks,G,6-5,220.0,Marquette,3196448.0
5,Trent Forrest,Atlanta Hawks,G,6-4,210.0,Florida State,508891.0
17,Trae Young,Atlanta Hawks,G,6-1,164.0,Oklahoma,40064220.0
10,Seth Lundy,Atlanta Hawks,G,6-6,220.0,Penn State,559782.0
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
...,...,...,...,...,...,...,...
576,Daniel Gafford,Washington Wizards,F-C,6-10,234.0,Arkansas,12402000.0
581,Corey Kispert,Washington Wizards,F,6-6,224.0,Gonzaga,3722040.0
574,Bilal Coulibaly,Washington Wizards,G,6-6,195.0,Metropolitans 92,6614256.0
579,Anthony Gill,Washington Wizards,F,6-8,230.0,Virginia,1997238.0


## Rank Values with the rank Method
- The `rank` method assigns a numeric ranking to each **Series** value.
- Pandas will assign the same rank to equal values and create a "gap" in the dataset for the ranks.

In [140]:
nba = pd.read_csv("nba.csv").dropna(how = "all")  # drops rows with all missing values
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983.0
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000.0
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244.0
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000.0
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522.0
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357.0
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864.0
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000.0
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,


In [None]:
nba["Salary"] = nba["Salary"].fillna(0).astype(int) # creating a new column "Salary Rank" based on the rank of the "Salary" column
nba

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary
0,Saddiq Bey,Atlanta Hawks,F,6-7,215.0,Villanova,4556983
1,Bogdan Bogdanovic,Atlanta Hawks,G,6-5,225.0,Fenerbahce,18700000
2,Kobe Bufkin,Atlanta Hawks,G,6-5,195.0,Michigan,4094244
3,Clint Capela,Atlanta Hawks,C,6-10,256.0,Elan Chalon,20616000
4,Bruno Fernando,Atlanta Hawks,F-C,6-10,240.0,Maryland,2581522
...,...,...,...,...,...,...,...
586,Jordan Poole,Washington Wizards,G,6-4,194.0,Michigan,27955357
587,Ryan Rollins,Washington Wizards,G,6-3,180.0,Toledo,1719864
588,Landry Shamet,Washington Wizards,G,6-4,190.0,Wichita State,10250000
589,Tristan Vukcevic,Washington Wizards,F,6-10,220.0,Real Madrid,0


In [None]:
nba["Salary Rank"] = nba["Salary"].rank(ascending=False).astype(int) # creating a new column "Salary Rank" based on the rank of the "Salary" column
nba.sort_values("Salary", ascending=False) # displaying the DataFrame sorted by "Salary" in descending order

Unnamed: 0,Name,Team,Position,Height,Weight,College,Salary,Salary Rank
175,Stephen Curry,Golden State Warriors,G,6-2,185.0,Davidson,51915615,1
461,Kevin Durant,Phoenix Suns,F,6-10,240.0,Texas,47649433,2
261,LeBron James,Los Angeles Lakers,F,6-9,250.0,St. Vincent-St. Mary HS (OH),47607350,4
145,Nikola Jokic,Denver Nuggets,C,6-11,284.0,Mega Basket,47607350,4
436,Joel Embiid,Philadelphia 76ers,C-F,7-0,280.0,Kansas,47607350,4
...,...,...,...,...,...,...,...,...
64,James Nnaji,Charlotte Hornets,F,6-11,250.0,FC Barcelona,0,540
132,Christian Wood,Dallas Mavericks,F,6-9,214.0,UNLV,0,540
126,Theo Pinson,Dallas Mavericks,G-F,6-7,212.0,North Carolina,0,540
125,Markieff Morris,Dallas Mavericks,F,6-9,245.0,Kansas,0,540
