# Exploratory Data Analysis (EDA)

## Introduction

In this notebook, we address the stakeholder questions using SQL queries on the cleaned streaming viewership data. The SQLite database is dynamically created from the cleaned CSV files.

---
## Step 1: Set Up the Environment

In [4]:

# Import necessary libraries
import pandas as pd
import sqlite3
import os

# Define the base project directory (update this to your project path)
base_project_path = '/Users/joshuastewart/Documents/Streaming Viewership EDA Project'

# Define subdirectory paths
data_path = os.path.join(base_project_path, 'Data')
clean_data_path = os.path.join(data_path, 'cleaned_data')
database_file = os.path.join(data_path, 'streaming_viewership.db')

# Define file paths
users_file = os.path.join(clean_data_path, 'users_table.csv')
sessions_file = os.path.join(clean_data_path, 'sessions_table.csv')
videos_file = os.path.join(clean_data_path, 'videos_table.csv')
devices_file = os.path.join(clean_data_path, 'devices_table.csv')
locations_file = os.path.join(clean_data_path, 'locations_table.csv')
        

---
## Step 2: Load Cleaned Data and Create SQLite Database

In [7]:

# Load cleaned data into Pandas DataFrames
users = pd.read_csv(users_file)
sessions = pd.read_csv(sessions_file)
videos = pd.read_csv(videos_file)
devices = pd.read_csv(devices_file)
locations = pd.read_csv(locations_file)

# Create a SQLite database on disk
conn = sqlite3.connect(database_file)  # Saves database to 'streaming_viewership.db'

# Write DataFrames to SQLite
users.to_sql('Users', conn, index=False, if_exists='replace')
sessions.to_sql('Sessions', conn, index=False, if_exists='replace')
videos.to_sql('Videos', conn, index=False, if_exists='replace')
devices.to_sql('Devices', conn, index=False, if_exists='replace')
locations.to_sql('Locations', conn, index=False, if_exists='replace')

# Display confirmation of tables in the database
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)
print("Tables in the SQLite database:")
print(tables)
        

Tables in the SQLite database:
        name
0      Users
1   Sessions
2     Videos
3    Devices
4  Locations


---
## Step 3: Exploratory Data Analysis (EDA)

### Content Preferences

**What are the top-performing genres among different user demographics (age groups, regions)?**
- To identify the top genres by demographic, we focused on the top three countries with the most users, segmented users into age groups, and calculated total viewing duration for each genre. The table shows the most-watched genre for each age group in these countries (Congo, Korea, and Wallis and Futuna), along with the total viewing duration.

In [43]:
query = '''
WITH TopCountries AS (
    SELECT 
        Country,
        COUNT(User_ID) AS Total_Users
    FROM Users
    GROUP BY Country
    ORDER BY Total_Users DESC
    LIMIT 3
),
RankedGenres AS (
    SELECT 
        CASE
            WHEN u.Age BETWEEN 0 AND 17 THEN '0-17'
            WHEN u.Age BETWEEN 18 AND 24 THEN '18-24'
            WHEN u.Age BETWEEN 25 AND 34 THEN '25-34'
            WHEN u.Age BETWEEN 35 AND 44 THEN '35-44'
            WHEN u.Age BETWEEN 45 AND 54 THEN '45-54'
            WHEN u.Age BETWEEN 55 AND 64 THEN '55-64'
            ELSE '65+'
        END AS Age_Group,
        u.Country,
        v.Genre,
        SUM(s."Duration_Watched (minutes)") AS Total_Viewing_Duration,
        COUNT(DISTINCT s.User_ID) AS Total_Users,
        ROW_NUMBER() OVER (
            PARTITION BY u.Country, 
                CASE
                    WHEN u.Age BETWEEN 0 AND 17 THEN '0-17'
                    WHEN u.Age BETWEEN 18 AND 24 THEN '18-24'
                    WHEN u.Age BETWEEN 25 AND 34 THEN '25-34'
                    WHEN u.Age BETWEEN 35 AND 44 THEN '35-44'
                    WHEN u.Age BETWEEN 45 AND 54 THEN '45-54'
                    WHEN u.Age BETWEEN 55 AND 64 THEN '55-64'
                    ELSE '65+'
                END
            ORDER BY SUM(s."Duration_Watched (minutes)") DESC
        ) AS Rank
    FROM Sessions s
    JOIN Users u ON s.User_ID = u.User_ID
    JOIN Videos v ON s.Video_ID = v.Video_ID
    WHERE u.Country IN (SELECT Country FROM TopCountries)
    GROUP BY 
        CASE
            WHEN u.Age BETWEEN 0 AND 17 THEN '0-17'
            WHEN u.Age BETWEEN 18 AND 24 THEN '18-24'
            WHEN u.Age BETWEEN 25 AND 34 THEN '25-34'
            WHEN u.Age BETWEEN 35 AND 44 THEN '35-44'
            WHEN u.Age BETWEEN 45 AND 54 THEN '45-54'
            WHEN u.Age BETWEEN 55 AND 64 THEN '55-64'
            ELSE '65+'
        END, 
        u.Country, 
        v.Genre
)
SELECT 
    Age_Group,
    Country,
    Genre,
    Total_Viewing_Duration,
    Total_Users
FROM RankedGenres
WHERE Rank = 1
ORDER BY Country, Age_Group;
'''
top_genre_with_users = pd.read_sql(query, conn)
top_genre_with_users

Unnamed: 0,Age_Group,Country,Genre,Total_Viewing_Duration,Total_Users
0,0-17,Congo,Comedy,2099.523509,6
1,18-24,Congo,Documentary,2596.326679,7
2,25-34,Congo,Documentary,3147.72556,11
3,35-44,Congo,Documentary,3363.366011,12
4,45-54,Congo,Drama,2384.429519,8
5,55-64,Congo,Documentary,648.59544,2
6,65+,Congo,Documentary,1082.229731,3
7,0-17,Korea,Drama,1486.747253,5
8,18-24,Korea,Thriller,2174.781322,7
9,25-34,Korea,Comedy,1131.430981,4


**Which genres drive the longest viewing durations, and how does this vary by subscription level?**
- To determine which genres drive the longest viewing durations and how this varies by subscription level, we calculated the average viewing duration for each genre across free and premium users. The table shows the genres ranked by their average viewing duration, separated by subscription status.

In [47]:

query = '''
SELECT 
    v.Genre,
    u.Subscription_Status,
    AVG(s."Duration_Watched (minutes)") AS Avg_Duration
FROM Sessions s
JOIN Users u ON s.User_ID = u.User_ID
JOIN Videos v ON s.Video_ID = v.Video_ID
GROUP BY v.Genre, u.Subscription_Status
ORDER BY Avg_Duration DESC;
'''
genres_by_subscription = pd.read_sql(query, conn)
genres_by_subscription
        

Unnamed: 0,Genre,Subscription_Status,Avg_Duration
0,Thriller,Premium,60.643858
1,Drama,Premium,60.633605
2,Comedy,Premium,60.541565
3,Sci-Fi,Premium,60.502312
4,Documentary,Premium,60.465722
5,Action,Premium,60.39423
6,Action,Free,60.358055
7,Thriller,Free,60.279314
8,Drama,Free,60.25824
9,Documentary,Free,60.253722


### User Engagement

**How does playback quality (e.g., HD, 4K) impact session duration and user interactions?**
- To analyze how playback quality impacts session duration and user interactions, we calculated the average session duration and interaction events for each playback quality (SD, HD, 4K). Interaction events include user actions like clicks, likes, and shares during a session.

- The table shows SD playback leads to the highest average duration, while HD results in the most user interactions.

In [53]:

query = '''
SELECT 
    s.Playback_Quality,
    AVG(s."Duration_Watched (minutes)") AS Avg_Duration,
    AVG(s.Interaction_Events) AS Avg_Interactions
FROM Sessions s
GROUP BY s.Playback_Quality
ORDER BY Avg_Duration DESC, Avg_Interactions DESC;
'''
playback_quality_impact = pd.read_sql(query, conn)
playback_quality_impact
        

Unnamed: 0,Playback_Quality,Avg_Duration,Avg_Interactions
0,SD,61.869022,49.988867
1,HD,59.685971,52.152237
2,4K,59.586784,50.203155


**Which devices are most commonly associated with higher engagement, and how do these trends differ between Premium and non-Premium users?**
- To identify devices associated with higher engagement, we calculated the average session duration and interaction events for each device type, segmented by subscription status (Premium and Free). The table shows that desktops have the highest engagement for Premium users, while laptops and tablets perform well among Free users.

In [58]:

query = '''
SELECT 
    u.Subscription_Status,
    d.Device_Type,
    AVG(s."Duration_Watched (minutes)") AS Avg_Duration,
    AVG(s.Interaction_Events) AS Avg_Interactions
FROM Sessions s
JOIN Users u ON s.User_ID = u.User_ID
JOIN Devices d ON s.Device_ID = d.Device_ID
GROUP BY u.Subscription_Status, d.Device_Type
ORDER BY Avg_Duration DESC, Avg_Interactions DESC;
'''
device_engagement = pd.read_sql(query, conn)
device_engagement
        

Unnamed: 0,Subscription_Status,Device_Type,Avg_Duration,Avg_Interactions
0,Premium,Desktop,60.755172,50.491667
1,Free,Laptop,60.670336,50.329787
2,Free,Tablet,60.663525,49.966372
3,Premium,Laptop,60.575642,51.180687
4,Premium,Tablet,60.509821,51.385928
5,Free,Smartphone,60.367631,51.121423
6,Free,Smart TV,60.310355,50.885749
7,Premium,Smart TV,60.310026,50.895335
8,Premium,Smartphone,60.201166,50.214406
9,Free,Desktop,59.886821,50.390464


### Customer Retention

**What factors (e.g., genre, device type, subscription level) correlate with higher session ratings?**
- To identify factors correlating with higher session ratings, we grouped ages into broader ranges and analyzed ratings and session durations by age group, genre, device type, and subscription status. The table shows high engagement from Premium users, with Action and Sci-Fi genres performing strongly across laptops and desktops for specific age groups.

In [75]:
query = '''
SELECT 
    CASE
        WHEN u.Age BETWEEN 0 AND 17 THEN '0-17'
        WHEN u.Age BETWEEN 18 AND 24 THEN '18-24'
        WHEN u.Age BETWEEN 25 AND 34 THEN '25-34'
        WHEN u.Age BETWEEN 35 AND 44 THEN '35-44'
        WHEN u.Age BETWEEN 45 AND 54 THEN '45-54'
        WHEN u.Age BETWEEN 55 AND 64 THEN '55-64'
        ELSE '65+'
    END AS Age_Group,
    v.Genre,
    d.Device_Type,
    u.Subscription_Status,
    AVG(s.Ratings) AS Avg_Rating,
    AVG(s."Duration_Watched (minutes)") AS Avg_Duration
FROM Sessions s
JOIN Users u ON s.User_ID = u.User_ID
JOIN Videos v ON s.Video_ID = v.Video_ID
JOIN Devices d ON s.Device_ID = d.Device_ID
GROUP BY Age_Group, v.Genre, d.Device_Type, u.Subscription_Status
ORDER BY Avg_Rating DESC, Avg_Duration DESC
LIMIT 10;
'''
factors_ratings = pd.read_sql(query, conn)
factors_ratings

Unnamed: 0,Age_Group,Genre,Device_Type,Subscription_Status,Avg_Rating,Avg_Duration
0,25-34,Action,Desktop,Premium,3.170462,61.808729
1,25-34,Action,Laptop,Premium,3.154801,61.954266
2,25-34,Action,Smartphone,Premium,3.145767,60.750865
3,65+,Sci-Fi,Desktop,Free,3.14486,58.488551
4,65+,Comedy,Laptop,Free,3.140578,58.323593
5,65+,Comedy,Desktop,Free,3.139299,58.561932
6,25-34,Sci-Fi,Desktop,Premium,3.136808,61.604663
7,65+,Comedy,Smartphone,Free,3.136792,59.951301
8,65+,Sci-Fi,Laptop,Free,3.135354,57.974041
9,25-34,Sci-Fi,Laptop,Premium,3.134729,61.827035


### Platform Optimization

**Are there device types or playback qualities we should optimize further to enhance the viewing experience?**
- To identify device types or playback qualities to optimize, we calculated the average session duration and interaction events for each combination of device type and playback quality. Devices with SD playback, particularly laptops and tablets, should be prioritized for optimizing session duration. Meanwhile, HD playback on smartphones and Smart TVs leads to the highest user interactions, suggesting these combinations could benefit from enhancements to further improve engagement.

In [89]:

query = '''
SELECT 
    d.Device_Type,
    s.Playback_Quality,
    AVG(s."Duration_Watched (minutes)") AS Avg_Duration,
    AVG(s.Interaction_Events) AS Avg_Interactions
FROM Sessions s
JOIN Devices d ON s.Device_ID = d.Device_ID
GROUP BY d.Device_Type, s.Playback_Quality
ORDER BY Avg_Duration DESC, Avg_Interactions DESC;
'''
device_quality_impact = pd.read_sql(query, conn)
device_quality_impact
        

Unnamed: 0,Device_Type,Playback_Quality,Avg_Duration,Avg_Interactions
0,Laptop,SD,62.314347,49.570142
1,Tablet,SD,61.668462,50.133163
2,Smartphone,SD,61.553223,49.543034
3,Smart TV,SD,61.392596,49.710315
4,Desktop,SD,61.285072,49.881988
5,Tablet,HD,60.270119,51.855677
6,Smart TV,HD,60.09894,52.269278
7,Desktop,4K,60.090835,49.745522
8,Smartphone,4K,59.958455,50.172564
9,Laptop,HD,59.869343,52.307097


**What steps can we take to improve engagement on underperforming subscription tiers?**
- To analyze engagement across subscription tiers, we calculated viewer counts, average session duration, interaction events, and ratings for Free and Premium users, along with minimum and maximum values for session duration and interactions. The table reveals that Free and Premium tiers have comparable engagement metrics, with Free users slightly outperforming Premium in ratings and interaction events. To improve Premium engagement, strategies could include exclusive content offerings, tailored interactive features, and enhancing playback quality to better justify the premium cost.

In [99]:
query = '''
SELECT 
    u.Subscription_Status,
    COUNT(DISTINCT s.User_ID) AS Viewer_Count,
    AVG(s."Duration_Watched (minutes)") AS Avg_Duration,
    MIN(s."Duration_Watched (minutes)") AS Min_Duration,
    MAX(s."Duration_Watched (minutes)") AS Max_Duration,
    AVG(s.Interaction_Events) AS Avg_Interactions,
    MIN(s.Interaction_Events) AS Min_Interactions,
    MAX(s.Interaction_Events) AS Max_Interactions,
    AVG(s.Ratings) AS Avg_Rating
FROM Sessions s
JOIN Users u ON s.User_ID = u.User_ID
GROUP BY u.Subscription_Status
ORDER BY Avg_Duration ASC;
'''
subscription_engagement = pd.read_sql(query, conn)
subscription_engagement

Unnamed: 0,Subscription_Status,Viewer_Count,Avg_Duration,Min_Duration,Max_Duration,Avg_Interactions,Min_Interactions,Max_Interactions,Avg_Rating
0,Free,3111,60.26343,0.055809,119.990886,50.688525,0,100,2.999679
1,Premium,3103,60.493643,0.072536,119.999972,50.865292,0,100,2.975185
