# Introduction

# Business Understanding

In a time where Netflix faces serious competition from a variety of streaming services that have come up, able to leverage their affiliation with networks to claim a number of successful TV shows to boost their subscribers and libraries, Netflix has one competitive edge: the ease of usability and technical superiority. If Netflix wants to remain a serious competitor in the streaming business, it cannot falter in this crucial aspect, and must continually adjust its platforms and respond to the faults which can drive away consumers and subscribers. Additionally, we want to understand which aspects of our platform resonate positively with consumers, so that we have a better idea of how to go about marketing our product in relevant ways, as well as gaining deeper knowledge about which aspects of our product can be developed further to deepen our consumers' appreciation.

We are tasked with addressing this very issue: locating trends among negative user reviews so that we can adjust our product and continue to provide a top-tier streaming service that stands above the rest in technical terms, as well as trends among positive reviews so we can better understand which aspects of our product are best emphasized in our future marketing campaigns. We will particularly pay close attention to reviews produced over the past year or so, to ensure that our interventions are timely and relevant to the issues faced by our clientele in the currrent moment or very recent past. Focusing on recent reviews also provides clarity of action, as there is little point in us locating trends among negative reviews from a few years back that had already been addressed in the time since, for example. 

Put another way, we seek to gain a better understanding of the following:
1. **Retention**, and why some users are unhappy with Netflix and may cancel their subscriptions.
2. **Marketing Insights**, and identify what it is that people currently appreciate about Netflix so it can inform our promotions.

What this requires of us is:
1. To build a **binary classification model** that can deploy Natural Language Processing (NLP), or more specifically, **Sentiment Analysis** on the user reviews, and use that insight to predict whether the review is positive or negative. This can be done through traditional supervised learning models, such as Logistic Regression, Random Forest Classifiers, SVM, etc.
2. Apply **clustering methods** to both groups of positive and negative reviews to identify trends and themes that will inform the actions we take.

In this way, we will be better equipped to provide solid business recommendations regarding subscriber retention as well as more effective marketing campaigns.

# Data Understanding

The [dataset we will be working with](https://www.kaggle.com/datasets/ashishkumarak/netflix-reviews-playstore-daily-updated?resource=download) was pulled from Kaggle and contains more than 129,000 reviews dating back to 2018, which is 7 years as of the time of  writing. This dataset is updated daily, and the data contained within it is up-to-date as of 2 March, 2025.

Of the 8 columns contained in this dataset, the following are of particular or potential interest to us, pending further investigation:
1. `content`, which contains the text of the user review. We will use this column for our NLP and Sentiment Analysis.
2. `# score`, which contains a discrete (categorical) numeric rating on a scale of 1-5. This will serve as our Target column whose Labels we will predict based on the the text of the user reviews.
3. `# thumbsUpCount`, which tells us how many 'thumbs up' each user review received from other reviews, potentially indicating the relative significance of different reviews since a higher count of thumbs up would indicate the review resonated with other users.
4. `at`, telling us the date the user review was created. We will be looking to filter our data so that we can focus on reviews produced over the past year 1-2 years, or since 2023.

### Features 
We will use the text contained within the `content` column to produce features for our classifier by vectorizing the text using **Term Frequency-Inverse Document Frequency (TF-IDF)**, which is a useful strategy for determing the relative significance of terms used in the user reviews by weighing the frequency of their appearances within a review against their relative rarity across all reviews in consideration. 

### Targets
The `# score` column will be used as our Target column by combining the low ratings (1 and 2) as a Negative class and combining the positive reviews (4 and 5) as a Positive class, and turning these classes into binaries: 0 for Negative, 1 for Positive.

We will be disregarding the ratings of 3 as we want a clear understanding of what makes a review strictly negative vs. strictly positive, and a middle of the road review of 3, as insightful as its content may be, would hinder our ability to understand the division of these sentiments.

### Class Distribution
Fortunately for us, the distribution of our Negative and Positive classes are relatively balanced, with a slight skew towards Negative.

## Data Preparation

# Modeling

# Conclusion

## Evaluation

## Limitations

## Recommendations

## Next Steps