# CSMODEL Project - Netflix Userbase Dataset Case Study

### Group 8
CAPAROS, MIGUEL ANTONIO <br> 
FERRER, ANGEL JUNE <br>
MARTINEZ, AZELIAH <br>
VILLANUEVA, KEISHA LEIGH <br>

# I. Dataset Description

The Netflix Userbase Dataset provides a snapshot of a sample Netflix userbase, showcasing various aspects of user subscriptions, revenue, account details, and activity. Each row represents a unique user, identified by their User ID. The dataset serves as a synthetic representation and does not reflect actual Netflix user data. 


## Data Collection Process

The dataset is synthetically sourced, and as such, any conclusions and insights may not accurately reflect real-world data. 



## Dataset File Structure

Each row in the dataset represents a unique user. Each columns contain various details about the user. The dataset contains a total of 2500 observations (rows) and 10 variables (columns). Each variable provides specific details about the users, enabling analysis of subscription patterns, revenue generation, and user behavior.

***If the dataset is composed of different files that you will combine in the succeeding steps, describe the structure and the contents of each file.***


## Dataset Variables

The dataset contains 10 variables, each representing different user information such as:

- User ID: A unique identifier for each user.
- Subscription Type: The type of subscription the user has (basic, standard, or premium).
- Monthly Revenue: The monthly revenue generated from the user's subscription.
- Join Date: The date the user joined Netflix.
- Last Payment Date: The date of the user's last payment.
- Country: The country where the user is located.
- Age: The age of the user.
- Gender: The gender of the user.
- Device Type: The type of device the user primarily uses to access Netflix (e.g., Smart TV, Mobile, Desktop, Tablet).


# II. Data Cleaning

In this Notebook, we will be using a Netflix Userbase Dataset, for the purpose of performing data cleaning.


## Import
Import **numpy** and **pandas**.

[**`pandas`**](https://pandas.pydata.org/pandas-docs/stable/index.html) is a Python software library that offers data structures and tools for data analysis.

In [1]:
import numpy as np
import pandas as pd

## The Dataset

Insert description

## Reading the Dataset

Our initial step is to load the dataset using pandas, which will import the data into a pandas `DataFrame`. We use the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to accomplish this.

In [2]:
userbase_df = pd.read_csv('Netflix Userbase.csv')

When loading a new dataset, it is advisable to utilize the [`info`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) function, as it displays general information regarding the dataset's structure and attributes.

In [3]:
userbase_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   User ID            2500 non-null   int64 
 1   Subscription Type  2500 non-null   object
 2   Monthly Revenue    2500 non-null   int64 
 3   Join Date          2500 non-null   object
 4   Last Payment Date  2500 non-null   object
 5   Country            2500 non-null   object
 6   Age                2500 non-null   int64 
 7   Gender             2500 non-null   object
 8   Device             2500 non-null   object
 9   Plan Duration      2500 non-null   object
dtypes: int64(3), object(7)
memory usage: 195.4+ KB


We can use the head() method to quickly view the initial rows of your dataset.

In [4]:
userbase_df.head()

Unnamed: 0,User ID,Subscription Type,Monthly Revenue,Join Date,Last Payment Date,Country,Age,Gender,Device,Plan Duration
0,1,Basic,10,15-01-22,10-06-23,United States,28,Male,Smartphone,1 Month
1,2,Premium,15,05-09-21,22-06-23,Canada,35,Female,Tablet,1 Month
2,3,Standard,12,28-02-23,27-06-23,United Kingdom,42,Male,Smart TV,1 Month
3,4,Standard,12,10-07-22,26-06-23,Australia,51,Female,Laptop,1 Month
4,5,Basic,10,01-05-23,28-06-23,Germany,33,Male,Smartphone,1 Month


## Handling Missing Data

To detect missing values (null values) in your DataFrame, you can use isnull() in combination with sum().

In [5]:
missing_data = userbase_df.isnull().sum()
print("Missing data:\n", missing_data)

Missing data:
 User ID              0
Subscription Type    0
Monthly Revenue      0
Join Date            0
Last Payment Date    0
Country              0
Age                  0
Gender               0
Device               0
Plan Duration        0
dtype: int64


## Outlier Handling

## Feature Engineering

## Duplicate Handling

In [9]:
duplicate_rows = userbase_df[userbase_df.duplicated(subset='Country', keep=False)]

if not duplicate_rows.empty:
    print("Duplicate rows based on 'Country':")
    print(duplicate_rows)
else:
    print("No duplicate rows found based on 'Country'.")

Duplicate rows based on 'Country':
      User ID Subscription Type  Monthly Revenue Join Date Last Payment Date  \
0           1             Basic               10  15-01-22          10-06-23   
1           2           Premium               15  05-09-21          22-06-23   
2           3          Standard               12  28-02-23          27-06-23   
3           4          Standard               12  10-07-22          26-06-23   
4           5             Basic               10  01-05-23          28-06-23   
...       ...               ...              ...       ...               ...   
2495     2496           Premium               14  25-07-22          12-07-23   
2496     2497             Basic               15  04-08-22          14-07-23   
2497     2498          Standard               12  09-08-22          15-07-23   
2498     2499          Standard               13  12-08-22          12-07-23   
2499     2500             Basic               15  13-08-22          12-07-23   

    

In [12]:
duplicate_rows = userbase_df[userbase_df.duplicated(subset='Subscription Type', keep=False)]

if not duplicate_rows.empty:
    print("Duplicate rows based on 'Subscription Type':")
    print(duplicate_rows)
else:
    print("No duplicate rows found based on 'Subscription Type'.")

Duplicate rows based on 'Subscription Type':
      User ID Subscription Type  Monthly Revenue Join Date Last Payment Date  \
0           1             Basic               10  15-01-22          10-06-23   
1           2           Premium               15  05-09-21          22-06-23   
2           3          Standard               12  28-02-23          27-06-23   
3           4          Standard               12  10-07-22          26-06-23   
4           5             Basic               10  01-05-23          28-06-23   
...       ...               ...              ...       ...               ...   
2495     2496           Premium               14  25-07-22          12-07-23   
2496     2497             Basic               15  04-08-22          14-07-23   
2497     2498          Standard               12  09-08-22          15-07-23   
2498     2499          Standard               13  12-08-22          12-07-23   
2499     2500             Basic               15  13-08-22          12-07-2

## Handling Inconsistent Formatting

Inconsistent formatting in a dataset can cause errors in analysis and lead to misinterpreted results. Different representations of the same data, such as variations in text case ('United States', 'united States', 'UNITED STATES') or inconsistent date formats, can be mistakenly treated as different values. This can distort analysis outcomes and produce inaccurate conclusions. By standardizing data formats, we ensure uniformity, which allows for accurate comparisons and aggregations. This consistency is essential for reliable data analysis and modeling.

To address this, let's standardize the 'Country' and 'Subscription Type' columns to lowercase. Using str.lower() converts all entries in the 'Country' and 'Subscription Type' columns to lowercase, ensuring consistency. The unique() method is then used to display all unique values in the column after the transformation, allowing us to verify that the standardization was successful.

In [13]:
userbase_df['Country'] = userbase_df['Country'].str.lower()

print("Unique values after converting 'Country' to lowercase:")
print(userbase_df['Country'].unique())
print("\n")

Unique values after converting 'Country' to lowercase:
['united states' 'canada' 'united kingdom' 'australia' 'germany' 'france'
 'brazil' 'mexico' 'spain' 'italy']




Convert all subsription types to lowercase for consistency

In [17]:
userbase_df['Subscription Type'] = userbase_df['Subscription Type'].str.lower()

print("Unique values after converting 'Subscription Type' to lowercase:")
print(userbase_df['Subscription Type'].unique())
print("\n")

Unique values after converting 'Subscription Type' to lowercase:
['basic' 'premium' 'standard']




# III. Exploratory Data Analysis

In this section of the notebook, you must fulfill the following:
- Identify at least __4 exploratory data analysis questions__. Properly state the questions in the notebook. Having more than 4 questions is acceptable, especially if this will help in understanding the data better.

Answer the EDA questions using both:
- Numerical Summaries – measures of central tendency, measures of dispersion, and correlation
- Visualization – Appropriate visualization should be used. Each visualization should be accompanied by a brief explanation.

To emphasize, __both numerical summary and visualization__ should be presented for each question.
The whole process should be supported with verbose textual descriptions of your procedures and findings.
