# CSMODEL Project - Netflix Userbase Dataset Case Study

### Group 8
CAPAROS, MIGUEL ANTONIO <br> 
FERRER, ANGEL JUNE <br>
MARTINEZ, AZELIAH <br>
VILLANUEVA, KEISHA LEIGH <br>

# I. Dataset Description

The Netflix Userbase Dataset provides a snapshot of a sample Netflix userbase, showcasing various aspects of user subscriptions, revenue, account details, and activity. Each row represents a unique user, identified by their User ID. The dataset serves as a synthetic representation and does not reflect actual Netflix user data. 


## Data Collection Process

The dataset is synthetically sourced, and as such, any conclusions and insights may not accurately reflect real-world data. 



## Dataset File Structure

Each row in the dataset represents a unique user. Each columns contain various details about the user. The dataset contains a total of 2500 observations (rows) and 10 variables (columns). Each variable provides specific details about the users, enabling analysis of subscription patterns, revenue generation, and user behavior.

***If the dataset is composed of different files that you will combine in the succeeding steps, describe the structure and the contents of each file.***


## Dataset Variables

The dataset contains 10 variables, each representing different user information such as:

- User ID: A unique identifier for each user.
- Subscription Type: The type of subscription the user has (basic, standard, or premium).
- Monthly Revenue: The monthly revenue generated from the user's subscription.
- Join Date: The date the user joined Netflix.
- Last Payment Date: The date of the user's last payment.
- Country: The country where the user is located.
- Age: The age of the user.
- Gender: The gender of the user.
- Device Type: The type of device the user primarily uses to access Netflix (e.g., Smart TV, Mobile, Desktop, Tablet).
- Plan Duration: The duration of the user's current subscription plan.


# II. Data Cleaning

In this section of the notebook, we will focus on cleaning the Netflix Userbase Dataset. Data cleaning is an essential step in the data analysis process, aimed at preparing raw data for further exploration and analysis. It involves identifying and correcting errors or inconsistencies in the data, handling missing values, removing duplicates, and ensuring data quality and integrity.

## Import Libraries

We begin by importing **`numpy`** and **`pandas`** which are essential libraries for data manipulation and analysis in Python to begin our data cleaning process.

**`numpy`** is a fundamental package for numerical computing in Python, providing support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. [**`pandas`**](https://pandas.pydata.org/pandas-docs/stable/index.html) is a Python software library that offers data structures and tools for data analysis.

In [18]:
import numpy as np
import pandas as pd

## The Dataset

Insert description

## Reading the Dataset

Our first step is to load the dataset using pandas, which will import the data into a pandas `DataFrame`. We use the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to accomplish this.

In [19]:
userbase_df = pd.read_csv('Netflix Userbase.csv')

When loading a new dataset, it is advisable to utilize the [`info`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) function, as it displays general information regarding the dataset's structure and attributes.

In [20]:
userbase_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   User ID            2500 non-null   int64 
 1   Subscription Type  2500 non-null   object
 2   Monthly Revenue    2500 non-null   int64 
 3   Join Date          2500 non-null   object
 4   Last Payment Date  2500 non-null   object
 5   Country            2500 non-null   object
 6   Age                2500 non-null   int64 
 7   Gender             2500 non-null   object
 8   Device             2500 non-null   object
 9   Plan Duration      2500 non-null   object
dtypes: int64(3), object(7)
memory usage: 195.4+ KB


We will use the [`head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) function to quickly view the first few rows of our dataset.

In [21]:
userbase_df.head()

Unnamed: 0,User ID,Subscription Type,Monthly Revenue,Join Date,Last Payment Date,Country,Age,Gender,Device,Plan Duration
0,1,Basic,10,15-01-22,10-06-23,United States,28,Male,Smartphone,1 Month
1,2,Premium,15,05-09-21,22-06-23,Canada,35,Female,Tablet,1 Month
2,3,Standard,12,28-02-23,27-06-23,United Kingdom,42,Male,Smart TV,1 Month
3,4,Standard,12,10-07-22,26-06-23,Australia,51,Female,Laptop,1 Month
4,5,Basic,10,01-05-23,28-06-23,Germany,33,Male,Smartphone,1 Month


## Handling Missing Data

Detecting and managing missing values is crucial for data analysis. To identify missing data within our DataFrame, we will use the [`isnull`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) function in combination with [`sum`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html). This approach allows us to understand the extent of missing values in each column, facilitating appropriate strategies for data cleaning and preprocessing.

In [22]:
missing_data = userbase_df.isnull().sum()
print("Missing data:\n", missing_data)

Missing data:
 User ID              0
Subscription Type    0
Monthly Revenue      0
Join Date            0
Last Payment Date    0
Country              0
Age                  0
Gender               0
Device               0
Plan Duration        0
dtype: int64


## Outlier Handling

Insert description

## Feature Engineering

Insert description

## Duplicate Handling

Duplicates in datasets can impact the accuracy of analysis results and should be managed to ensure data integrity. Identifying and managing duplicates is an essential step in data preprocessing. We use the [`duplicated`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) function to find rows that are exact duplicates based on specified columns (`subset`). For instance, we will check for duplicates based on 'Country' and 'Subscription Type'. If duplicates are identified (`not duplicate_rows.empty`), they are displayed for further examination. Otherwise, a message indicating no duplicates are found is shown.

This code snippet identifies and prints duplicate rows in the dataset that have the same values in the 'Country' column. The `duplicated()` function with `keep=False` ensures all occurrences of duplicates are highlighted.

In [23]:
duplicate_rows = userbase_df[userbase_df.duplicated(subset='Country', keep=False)]

if not duplicate_rows.empty:
    print("Duplicate rows based on 'Country':")
    print(duplicate_rows)
else:
    print("No duplicate rows found based on 'Country'.")

Duplicate rows based on 'Country':
      User ID Subscription Type  Monthly Revenue Join Date Last Payment Date  \
0           1             Basic               10  15-01-22          10-06-23   
1           2           Premium               15  05-09-21          22-06-23   
2           3          Standard               12  28-02-23          27-06-23   
3           4          Standard               12  10-07-22          26-06-23   
4           5             Basic               10  01-05-23          28-06-23   
...       ...               ...              ...       ...               ...   
2495     2496           Premium               14  25-07-22          12-07-23   
2496     2497             Basic               15  04-08-22          14-07-23   
2497     2498          Standard               12  09-08-22          15-07-23   
2498     2499          Standard               13  12-08-22          12-07-23   
2499     2500             Basic               15  13-08-22          12-07-23   

    

Similarly, this code checks for duplicates based on the 'Subscription Type' column. It provides a clear indication of whether duplicate records exist in the dataset related to subscription types.

In [24]:
duplicate_rows = userbase_df[userbase_df.duplicated(subset='Subscription Type', keep=False)]

if not duplicate_rows.empty:
    print("Duplicate rows based on 'Subscription Type':")
    print(duplicate_rows)
else:
    print("No duplicate rows found based on 'Subscription Type'.")

Duplicate rows based on 'Subscription Type':
      User ID Subscription Type  Monthly Revenue Join Date Last Payment Date  \
0           1             Basic               10  15-01-22          10-06-23   
1           2           Premium               15  05-09-21          22-06-23   
2           3          Standard               12  28-02-23          27-06-23   
3           4          Standard               12  10-07-22          26-06-23   
4           5             Basic               10  01-05-23          28-06-23   
...       ...               ...              ...       ...               ...   
2495     2496           Premium               14  25-07-22          12-07-23   
2496     2497             Basic               15  04-08-22          14-07-23   
2497     2498          Standard               12  09-08-22          15-07-23   
2498     2499          Standard               13  12-08-22          12-07-23   
2499     2500             Basic               15  13-08-22          12-07-2

## Handling Inconsistent Formatting

Inconsistent formatting in a dataset can cause errors in analysis and lead to misinterpreted results. Different representations of the same data, such as variations in text case ('United States', 'united States', 'UNITED STATES') or inconsistent date formats, can be mistakenly treated as different values. This can distort analysis outcomes and produce inaccurate conclusions. By standardizing data formats, we ensure uniformity, which allows for accurate comparisons and aggregations. This consistency is essential for reliable data analysis and modeling.

To address this, let's standardize the 'Country' and 'Subscription Type' columns to lowercase. Using [`str.lower`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html) function, it converts all entries in the 'Country' and 'Subscription Type' columns to lowercase, ensuring consistency. The [`unique`](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html) function is then used to display all unique values in the column after the transformation, allowing us to verify that the standardization was successful.

In [25]:
userbase_df['Country'] = userbase_df['Country'].str.lower()

print("Unique values after converting 'Country' to lowercase:")
print(userbase_df['Country'].unique())
print("\n")

Unique values after converting 'Country' to lowercase:
['united states' 'canada' 'united kingdom' 'australia' 'germany' 'france'
 'brazil' 'mexico' 'spain' 'italy']




Convert all subsription types to lowercase for consistency

In [26]:
userbase_df['Subscription Type'] = userbase_df['Subscription Type'].str.lower()

print("Unique values after converting 'Subscription Type' to lowercase:")
print(userbase_df['Subscription Type'].unique())
print("\n")

Unique values after converting 'Subscription Type' to lowercase:
['basic' 'premium' 'standard']




# III. Exploratory Data Analysis

In this section of the notebook, we aim to explore and understand various aspects of the Netflix Userbase Dataset through exploratory data analysis. We will address the following questions:

1. **Subscription Description**


   *Question: What is the distribution of users across different subscription types?*



2. **Revenue Analysis**


   *Question: How much revenue is generated from each subscription type?*


3. **User Retention**


   *Question:*


4. **Device Type Usage**


   *Question: What is the distribution of device types used by Netflix users?*

# IV. Research Question

The research question drives the focus of our data analysis project and should emerge from insights gained during exploratory data analysis (EDA).

- **Are there any patterns in subscription type preferences based on country?**

This research question examines whether distinct patterns exist in Netflix subscription type preferences (Basic, Standard, Premium) across different countries. Understanding these patterns is crucial for Netflix to tailor its content offerings and subscription plans according to regional preferences, potentially improving user satisfaction and retention. By uncovering significant insights from exploratory data analysis (EDA) of the Netflix userbase dataset, this study aims to provide actionable recommendations for strategic decision-making in global market expansion and customer engagement strategies. Through this analysis, we seek to reveal regional trends in subscription choices, offering valuable insights into user behavior across diverse countries.