<a href="https://colab.research.google.com/github/krauseannelize/nb-py-ms-exercises/blob/sprint03/notebooks/s03_pandas_foundation/33_working_with_dataframes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 33 | Working With DataFrames

## Intro to DataFrame Operations

Sorting and modifying data are essential for organizing and cleaning datasets. These operations help prepare data for analysis and visualization.

## Importing `Pandas` & `NumPy`

In [3]:
import pandas as pd
import numpy as np

## Preparing for Random Data Generation

Before creating a sample DataFrame, we set up our environment to generate random numbers in a controlled way:

- We're importing `randn` from `numpy.random` to generate random values drawn from a standard normal distribution to populate our DataFrame.
- By using `np.random.seed(101)`, we ensure that the random numbers generated are reproducible. This means every time the code runs, it produces the same output.

In [4]:
from numpy.random import randn
np.random.seed(101)

## Creating a Sample DataFrame with Random Values

- `randn(5, 4)` generates 5 rows and 4 columns of random numbers
- **Rows** are labeled 'A', 'B', 'C', 'D' and 'E'.
- **Columns** are labeled 'W', 'X', 'Y', and 'Z'.

In [5]:
df = pd.DataFrame(randn(5,4), index='A B C D E'.split(), columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Sorting Data

Sorting helps in organizing data for better readability and analysis. By using `sort_values()`, we can:

- sort a DataFrame by one or more columns.
- specify `ascending=True` for ascending order (default) or `ascending=False` for descending order.

In [6]:
# sort the DataFrame by index
df.sort_index()

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [7]:
# sort the DataFrame by column 'X' in descending order
df.sort_values(by='X', ascending=False)

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509
C,-2.018168,0.740122,0.528813,-0.589001
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057


In [8]:
# use .sort_index with axis=1 to sort columns (not rows)
df.sort_index(axis=1, ascending=False)

Unnamed: 0,Z,Y,X,W
A,0.503826,0.907969,0.628133,2.70685
B,0.605965,-0.848077,-0.319318,0.651118
C,-0.589001,0.528813,0.740122,-2.018168
D,0.955057,-0.933237,-0.758872,0.188695
E,0.683509,2.605967,1.978757,0.190794


In [9]:
# sort the DataFrame by multiple columns in different orders
df.sort_values(by=['X', 'Y'], ascending=[True, False])

Unnamed: 0,W,X,Y,Z
D,0.188695,-0.758872,-0.933237,0.955057
B,0.651118,-0.319318,-0.848077,0.605965
A,2.70685,0.628133,0.907969,0.503826
C,-2.018168,0.740122,0.528813,-0.589001
E,0.190794,1.978757,2.605967,0.683509


## Arithmetic Operations

Perform arithmetic operations directly on DataFrame columns to generate new insights or adjust existing values. You can also aggregate data using built-in methods that summarize values across rows or columns.

Common aggregation methods include:

- `Sum()`: Calculates the total of a column or across rows.
- `Mean()`: Computes the average value of a column or rows.
- `Max()`: Returns the highest value in a column or row.
- `Min()`: Returns the lowest value in row or column.

### Performing basic operations

In [10]:
# view current DataFrame
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [11]:
# subtract 1 from all values in column 'X'
df['X'] = df['X'] - 1
df

Unnamed: 0,W,X,Y,Z
A,2.70685,-0.371867,0.907969,0.503826
B,0.651118,-1.319318,-0.848077,0.605965
C,-2.018168,-0.259878,0.528813,-0.589001
D,0.188695,-1.758872,-0.933237,0.955057
E,0.190794,0.978757,2.605967,0.683509


In [12]:
# multiply all values in column 'Y' by 2
df['Y'] = df['Y'] * 2
df

Unnamed: 0,W,X,Y,Z
A,2.70685,-0.371867,1.815939,0.503826
B,0.651118,-1.319318,-1.696154,0.605965
C,-2.018168,-0.259878,1.057627,-0.589001
D,0.188695,-1.758872,-1.866474,0.955057
E,0.190794,0.978757,5.211935,0.683509


## Aggregating Values

In [13]:
# sum all columns
df.sum()

Unnamed: 0,0
W,1.719289
X,-2.731178
Y,4.522872
Z,2.159356


In [14]:
# sum of all rows
df.sum(axis=1)

Unnamed: 0,0
A,4.654747
B,-1.758389
C,-1.80942
D,-2.481595
E,7.064995


In [15]:
# highest value in each column
df.max()

Unnamed: 0,0
W,2.70685
X,0.978757
Y,5.211935
Z,0.955057


In [16]:
# highest value in each row
df.max(axis=1)

Unnamed: 0,0
A,2.70685
B,0.651118
C,1.057627
D,0.955057
E,5.211935


## Info on Unique Values

In [17]:
# return unique values in column 'Z'
df['Z'].unique()

array([ 0.50382575,  0.60596535, -0.58900053,  0.95505651,  0.68350889])

In [18]:
# count how many unique values are in column 'Z'
df['Z'].nunique()

5

In [19]:
# count how many times each value appears in column 'Z'
df['Z'].value_counts()

Unnamed: 0_level_0,count
Z,Unnamed: 1_level_1
0.503826,1
0.605965,1
-0.589001,1
0.955057,1
0.683509,1


## Adding and Removing Columns

- Adding Columns by assign a new column directly: `df["new_column"] = values`
- Removing Columns by using `drop()`: `df.drop(columns=["column_name"])`
- Permanently Removing Columns by using `del()`: `df.del(columns=["column_name"])`

In [21]:
# adding a new 'State' column
df['States'] = 'CA NY WY OR CO'.split()
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,-0.371867,1.815939,0.503826,CA
B,0.651118,-1.319318,-1.696154,0.605965,NY
C,-2.018168,-0.259878,1.057627,-0.589001,WY
D,0.188695,-1.758872,-1.866474,0.955057,OR
E,0.190794,0.978757,5.211935,0.683509,CO


In [22]:
# temporarily drop a column
df.drop(columns=['States'])

Unnamed: 0,W,X,Y,Z
A,2.70685,-0.371867,1.815939,0.503826
B,0.651118,-1.319318,-1.696154,0.605965
C,-2.018168,-0.259878,1.057627,-0.589001
D,0.188695,-1.758872,-1.866474,0.955057
E,0.190794,0.978757,5.211935,0.683509


In [23]:
# DataFrame remains unchanged
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,-0.371867,1.815939,0.503826,CA
B,0.651118,-1.319318,-1.696154,0.605965,NY
C,-2.018168,-0.259878,1.057627,-0.589001,WY
D,0.188695,-1.758872,-1.866474,0.955057,OR
E,0.190794,0.978757,5.211935,0.683509,CO


In [24]:
# permanently delete a column
del df['States']
df

Unnamed: 0,W,X,Y,Z
A,2.70685,-0.371867,1.815939,0.503826
B,0.651118,-1.319318,-1.696154,0.605965
C,-2.018168,-0.259878,1.057627,-0.589001
D,0.188695,-1.758872,-1.866474,0.955057
E,0.190794,0.978757,5.211935,0.683509


## Renaming Columns

Use `.rename()` to rename columns to make column names more descriptive or consistent.

In [31]:
# rename column 'X' to 'Price' and 'Y' to 'Sales'
df = df.rename(columns={'X': 'Price', 'Y': 'Sales'})
df

Unnamed: 0,W,Price,Sales,Z
A,2.70685,-0.371867,1.815939,0.503826
B,0.651118,-1.319318,-1.696154,0.605965
C,-2.018168,-0.259878,1.057627,-0.589001
D,0.188695,-1.758872,-1.866474,0.955057
E,0.190794,0.978757,5.211935,0.683509


## Summary Practice Exercise

### Reading Excel files

To load a file from local computer in Colab, execute the following block and select the file manually.

In [25]:
from google.colab import files
uploaded = files.upload()

Saving sample-superstore.xls to sample-superstore.xls


In [26]:
# import excel data into a DataFrame and view
df_super = pd.read_excel('sample-superstore.xls', sheet_name=0)
df_super

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country/Region,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,US-2020-103800,2020-01-03,2020-01-07,Standard Class,DP-13000,Darren Powers,Consumer,United States,Houston,...,77095,Central,OFF-PA-10000174,Office Supplies,Paper,"Message Book, Wirebound, Four 5 1/2"" X 4"" Form...",16.448,2,0.2,5.5512
1,2,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-BI-10004094,Office Supplies,Binders,GBC Standard Plastic Binding Systems Combs,3.540,2,0.8,-5.4870
2,3,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-LA-10003223,Office Supplies,Labels,Avery 508,11.784,3,0.2,4.2717
3,4,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-ST-10002743,Office Supplies,Storage,SAFCO Boltless Steel Shelving,272.736,3,0.2,-64.7748
4,5,US-2020-141817,2020-01-05,2020-01-12,Standard Class,MB-18085,Mick Brown,Consumer,United States,Philadelphia,...,19143,East,OFF-AR-10003478,Office Supplies,Art,Avery Hi-Liter EverBold Pen Style Fluorescent ...,19.536,3,0.2,4.8840
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10189,10190,US-2023-143259,2023-12-30,2024-01-03,Standard Class,PO-18865,Patrick O'Donnell,Consumer,United States,New York City,...,10009,East,OFF-BI-10003684,Office Supplies,Binders,Wilson Jones Legal Size Ring Binders,52.776,3,0.2,19.7910
10190,10191,US-2023-115427,2023-12-30,2024-01-03,Standard Class,EB-13975,Erica Bern,Corporate,United States,Fairfield,...,94533,West,OFF-BI-10004632,Office Supplies,Binders,GBC Binding covers,20.720,2,0.2,6.4750
10191,10192,US-2023-156720,2023-12-30,2024-01-03,Standard Class,JM-15580,Jill Matthias,Consumer,United States,Loveland,...,80538,West,OFF-FA-10003472,Office Supplies,Fasteners,Bagged Rubber Bands,3.024,3,0.2,-0.6048
10192,10193,US-2023-143259,2023-12-30,2024-01-03,Standard Class,PO-18865,Patrick O'Donnell,Consumer,United States,New York City,...,10009,East,TEC-PH-10004774,Technology,Phones,Gear Head AU3700S Headset,90.930,7,0.0,2.7279


### Getting info

In [27]:
# view the first 5 rows
df_super.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country/Region,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,US-2020-103800,2020-01-03,2020-01-07,Standard Class,DP-13000,Darren Powers,Consumer,United States,Houston,...,77095,Central,OFF-PA-10000174,Office Supplies,Paper,"Message Book, Wirebound, Four 5 1/2"" X 4"" Form...",16.448,2,0.2,5.5512
1,2,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-BI-10004094,Office Supplies,Binders,GBC Standard Plastic Binding Systems Combs,3.54,2,0.8,-5.487
2,3,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-LA-10003223,Office Supplies,Labels,Avery 508,11.784,3,0.2,4.2717
3,4,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-ST-10002743,Office Supplies,Storage,SAFCO Boltless Steel Shelving,272.736,3,0.2,-64.7748
4,5,US-2020-141817,2020-01-05,2020-01-12,Standard Class,MB-18085,Mick Brown,Consumer,United States,Philadelphia,...,19143,East,OFF-AR-10003478,Office Supplies,Art,Avery Hi-Liter EverBold Pen Style Fluorescent ...,19.536,3,0.2,4.884


In [28]:
# view the last 2 rows
df_super.tail(2)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country/Region,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
10192,10193,US-2023-143259,2023-12-30,2024-01-03,Standard Class,PO-18865,Patrick O'Donnell,Consumer,United States,New York City,...,10009,East,TEC-PH-10004774,Technology,Phones,Gear Head AU3700S Headset,90.93,7,0.0,2.7279
10193,10194,CA-2023-143500,2023-12-30,2024-01-03,Standard Class,HO-15230,Harry Olson,Consumer,Canada,Charlottetown,...,C0A,East,OFF-BI-10004040,Office Supplies,Binders,Wilson Jones Impact Binders,3.024,3,0.2,-0.6048


In [29]:
# view the shape of the DataFrame
df_super.shape

(10194, 21)

In [33]:
# view the default RangeIndex used to label rows
df_super.index

RangeIndex(start=0, stop=10194, step=1)

In [34]:
# view a summary of the DataFrame including column types, non-null counts, and memory usage
df_super.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10194 entries, 0 to 10193
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Row ID          10194 non-null  int64         
 1   Order ID        10194 non-null  object        
 2   Order Date      10194 non-null  datetime64[ns]
 3   Ship Date       10194 non-null  datetime64[ns]
 4   Ship Mode       10194 non-null  object        
 5   Customer ID     10194 non-null  object        
 6   Customer Name   10194 non-null  object        
 7   Segment         10194 non-null  object        
 8   Country/Region  10194 non-null  object        
 9   City            10194 non-null  object        
 10  State/Province  10194 non-null  object        
 11  Postal Code     10194 non-null  object        
 12  Region          10194 non-null  object        
 13  Product ID      10194 non-null  object        
 14  Category        10194 non-null  object        
 15  Su

In [35]:
# view datatypes of the DataFrame
df_super.dtypes

Unnamed: 0,0
Row ID,int64
Order ID,object
Order Date,datetime64[ns]
Ship Date,datetime64[ns]
Ship Mode,object
Customer ID,object
Customer Name,object
Segment,object
Country/Region,object
City,object


### Accessing the data

In [36]:
# get unique values in the 'Region' column
df_super['Region'].unique()

array(['Central', 'East', 'South', 'West'], dtype=object)

In [37]:
# use .isnull().sum() to count missing values in each column
# returns the total number of NaNs per column, useful for spotting incomplete data
df_super.isnull().sum()

Unnamed: 0,0
Row ID,0
Order ID,0
Order Date,0
Ship Date,0
Ship Mode,0
Customer ID,0
Customer Name,0
Segment,0
Country/Region,0
City,0


In [38]:
# return unique values in 'Segment' column
df_super['Segment'].unique()

array(['Consumer', 'Home Office', 'Corporate'], dtype=object)

In [39]:
# count how many instances of each unique value in the 'Segment' column
df_super['Segment'].value_counts()

Unnamed: 0_level_0,count
Segment,Unnamed: 1_level_1
Consumer,5281
Corporate,3090
Home Office,1823
