# ETL Process for Streaming Viewership Data

## Introduction

**Purpose:**

Clean and transform the raw streaming viewership dataset into structured tables suitable for analysis.

---

## Extract

### Step 1: Load the Raw Data

**Files:**

- **Streaming Viewership Data**: Load `streaming_viewership_data.csv` to extract and analyze session-level and user-level information.

**Description:**

The dataset contains raw information on users, sessions, videos, devices, and locations. We will load this data into a Pandas DataFrame for initial processing and validation.

In [46]:

import pandas as pd

# Load the raw dataset
file_path = '/Users/joshuastewart/Documents/Streaming Viewership EDA Project/Data/streaming_viewership_data.csv'
data = pd.read_csv(file_path)

# Display data overview
print("Data Overview:")
print(data.head())
print(f"Dataset contains {data.shape[0]} rows and {data.shape[1]} columns.")


Data Overview:
                                User_ID                            Session_ID  \
0  eb4f9229-74df-45f6-baac-cf19241b8b30  cb2142a7-0750-49ed-b8ac-a975fe1ff69a   
1  661d4b59-4328-410a-901c-1e3b4c40c334  3bc0a662-b353-4015-8b0c-55ceb510d13a   
2  dd3fe9e9-ea82-4891-ab93-8a47c80e3251  bd545b4a-9f54-4e87-b9f8-15ae20b44f22   
3  a1b3365b-1d00-4ddf-bc43-02fc9c10c680  0441086d-c59e-478d-a496-5c5b995ecfdb   
4  338d3f91-5f1c-4590-8803-324901826406  0295f01d-7f15-4799-856c-90c688697ef8   

   Device_ID  Video_ID  Duration_Watched (minutes)        Genre  \
0        232        11                   90.044525       Sci-Fi   
1        549        85                   68.973479       Comedy   
2        844        50                   42.511343       Comedy   
3        201        38                   53.316660  Documentary   
4        700        31                   69.437786       Action   

                            Country  Age  Gender Subscription_Status  Ratings  \
0             

---
## Transform

### Step 2: Check for Duplicates and Missing Data

**Files:**

- **Streaming Viewership Data**: Process `streaming_viewership_data.csv` to identify and remove duplicate rows and handle missing values.

**Description:**

This step ensures the dataset’s integrity by eliminating redundant rows and addressing null or incomplete data, preparing it for transformation into relational tables.

In [22]:

# Check for duplicate rows
duplicates = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Remove duplicates if found
if duplicates > 0:
    data = data.drop_duplicates()
    print("Duplicates removed.")

# Check for missing values
missing = data.isnull().sum()
print("Missing values in each column:")
print(missing)


Number of duplicate rows: 0
Missing values in each column:
User_ID                       0
Session_ID                    0
Device_ID                     0
Video_ID                      0
Duration_Watched (minutes)    0
Genre                         0
Country                       0
Age                           0
Gender                        0
Subscription_Status           0
Ratings                       0
Languages                     0
Device_Type                   0
Location                      0
Playback_Quality              0
Interaction_Events            0
dtype: int64


### Step 3: Transform Data into Relational Tables

**Files:**

- **Users Table**: Extract unique user-level information such as demographics and subscription status.
- **Sessions Table**: Extract session-level data, including playback details and interactions.
- **Videos Table**: Extract video-specific details such as genre and language.
- **Devices Table**: Extract information about device types.
- **Locations Table**: Extract unique mappings of locations and countries.

**Description:**

Split the cleaned dataset into five structured tables—Users, Sessions, Videos, Devices, and Locations—to organize the data logically and support relational queries.

In [25]:

# Users Table
users = data[['User_ID', 'Country', 'Age', 'Gender', 'Subscription_Status']].drop_duplicates()
print(f"Users Table: {users.shape[0]} rows")

# Sessions Table
sessions = data[['Session_ID', 'User_ID', 'Device_ID', 'Video_ID', 
                 'Duration_Watched (minutes)', 'Playback_Quality', 
                 'Interaction_Events', 'Ratings']].drop_duplicates()
print(f"Sessions Table: {sessions.shape[0]} rows")

# Videos Table
videos = data[['Video_ID', 'Genre', 'Languages']].drop_duplicates()
print(f"Videos Table: {videos.shape[0]} rows")

# Devices Table
devices = data[['Device_ID', 'Device_Type']].drop_duplicates()
print(f"Devices Table: {devices.shape[0]} rows")

# Locations Table
locations = data[['Location', 'Country']].drop_duplicates()
print(f"Locations Table: {locations.shape[0]} rows")


Users Table: 6214 rows
Sessions Table: 6214 rows
Videos Table: 2618 rows
Devices Table: 3563 rows
Locations Table: 6207 rows


---
## Load

### Step 4: Save the Cleaned Tables to CSV

**Files:**

- Save `users_table.csv`, `sessions_table.csv`, `videos_table.csv`, `devices_table.csv`, and `locations_table.csv` to the `cleaned_data` folder for future use.

**Description:**

Export each relational table as a separate CSV file to streamline future analysis and maintain data organization.

In [30]:

import os

# Create folder for cleaned data if it doesn't exist
output_dir = '/Users/joshuastewart/Documents/Streaming Viewership EDA Project/Data/cleaned_data'
os.makedirs(output_dir, exist_ok=True)

users.to_csv(f'{output_dir}/users_table.csv', index=False)
sessions.to_csv(f'{output_dir}/sessions_table.csv', index=False)
videos.to_csv(f'{output_dir}/videos_table.csv', index=False)
devices.to_csv(f'{output_dir}/devices_table.csv', index=False)
locations.to_csv(f'{output_dir}/locations_table.csv', index=False)

print("Cleaned data saved to 'cleaned_data' folder.")


Cleaned data saved to 'cleaned_data' folder.
