In [1]:
from lec_utils import *

<div class="alert alert-info" markdown="1">

#### Discussion 3

# DataFrames and Querying


### EECS 398: Practical Data Science, Winter 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/wn25">github.com/practicaldsc/wn25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/69737/discussion/5943734) </small>
    
</div>

### Agenda 📆

- `value_counts`.
- `loc` vs. `iloc`.
- Querying 🔍.
- Worksheet 📝.

### Today's Dataset 📲

- Today, we're going to be working with a dataset on Social Media and Entertainment Usage in the USA (maybe rip TikTok?).

- A reminder that we can read in a CSV using `pd.read_csv(path)`, where `path` is the dataset's location relative to where your notebook is stored.

- Let's store our dataset in the `df` variable. 

In [None]:
df = pd.read_csv('data/social_media_entertainment.csv')
df

### Choosing an index

- The index of our DataFrame needs to be a **unique** identifier per row.

- Here, it makes sense to choose User ID as our index since it takes on a different value per row.

In [None]:
print('Original shape: ', df.shape)
df = df.set_index('User ID')
print('New shape: ', df.shape)

- Observe how now we only have 39 columns instead of 40!

### `value_counts` 

- Each row/column of our DataFrame is a **Series**, a 1D, array-like object.

- `value_counts` is a **Series method** that returns another Series containing the count of unique values in a specific column or Series.

- In other words, it describes the **distribution** of the unique values within a certain column.

In [None]:
s = df['Preferred Entertainment Platform']
s

In [None]:
# What is the distribution of Preferred Entertainment Platforms?
df['Preferred Entertainment Platform'].value_counts(normalize=True)

In [None]:
# Which entertainment platform is used the most?
df['Preferred Entertainment Platform'].value_counts().idxmax()

In [None]:
# How many users prefer the leading entertainment platform? 
df['Preferred Entertainment Platform'].value_counts().max()

In [None]:
# How many prefer the least popular entertainment platform?
df['Preferred Entertainment Platform'].value_counts().min()

### `loc` vs. `iloc` 

- `iloc` stands for "integer-location".

- You use it to select data by ***integer*** position (i.e., row and column numbers).

- `iloc` uses zero-based indexing, just like 2D arrays!

In [None]:
# Select rows 1-6 in our DataFrame along with columns 3-8
df.iloc[1:7, 3:9]

- `loc` stands for "location" and allows for **label-based** indexing.

- The `loc` indexer works similarly to slicing 2D arrays, but it uses **row and column labels** instead, not positions.<br><small>Remember, the "index" refers to the row labels.</small>

In [None]:
# Show the Sleep Quality for user with id 3
df.loc[3, 'Sleep Quality (scale 1-10)']

### Example: Daily Social Media Usage 📱

- Let's take a closer look at how `loc` and `iloc` work by examining the `'Daily Social Media Time (hrs)'` column.

In [None]:
s = df['Daily Social Media Time (hrs)']
s

- Note that `s` is a Series object!

In [None]:
type(s)

In [None]:
# What is the daily social media time (in hours) of user with ID 3?
s.loc[3]

In [None]:
# What is the daily social media time (in hours) of the user in row 3 (zero-indexed)?
s.iloc[3]

- Indeed, we can verify that this is the case!

In [None]:
s.head(4)

- Notice that `loc` allows us to directly extract information about a user if we know their user ID, which is our DataFrame's index.

- If we don't know their user ID, we can instead sort the values in our series and use `iloc`!

- In general, because `iloc` relies on hard-coded integers to access values, it is most useful to us when we know *exactly* where certain values lie in our data. 

- So... always use it after sorting!

In [None]:
# What is the highest social media usage time?
s.sort_values(ascending=False).iloc[0]

- We use `iloc` instead of `loc` here because we don't know the User ID of the user with the highest social media time.

### Using `loc` and `iloc` on a DataFrame

- Recall that you can provide a sequence (list, array, Series) as either argument to `loc`:

In [None]:
# The first argument is the row label, i.e. the index value.
#        ↓
df.loc[3, 'Daily Social Media Time (hrs)']
#                  ↑
# The second argument is the column label.

In [None]:
df.loc[[3, 8, 24, 25], ['Primary Platform', 'Daily Social Media Time (hrs)']]

- With `iloc`, you can provide a range of values for both the row and column.

In [None]:
df.iloc[:6, :8]

In [None]:
# Find the age of user with highest daily messaging time.
# Our DataFrame has a column called 'Daily Messaging Time (hrs)' and 'Age'.
df.sort_values('Daily Messaging Time (hrs)', ascending=False)['Age'].iloc[0]

### Querying 🔍

- We use querying to extract rows in our DataFrame that satisfies certain ***conditions***.

- Boolean statements are particularly useful!

In [None]:
df['Primary Platform'] == 'TikTok'

- Combining these with `loc` allows us to select the rows we're looking for.

In [None]:
df.loc[df['Primary Platform'] == 'TikTok']

- You can also include multiple conditions to query for! Use Boolean operators `&` or `|` to separate them.

In [None]:
df.loc[(df['Primary Platform'] == 'TikTok') & (df['Daily Social Media Time (hrs)'] > 3.0)]

In [None]:
# How many users use TikTok as their primary platform?
df.loc[(df['Primary Platform'] == 'TikTok')].shape[0]

In [None]:
# Among users who use TikTok as their primary platform, what is the highest Daily Social Media Time?
df.loc[df['Primary Platform'] == 'TikTok', 'Daily Social Media Time (hrs)'].sort_values(ascending=False).iloc[0]

In [None]:
# Which gender prefers to use Spotify the most?
# Our DataFrame has a column called 'Preferred Entertainment Platform'.
df.loc[df['Preferred Entertainment Platform'] == 'Spotify', 'Gender'].value_counts().idxmax()