# Preliminary Steps

In [13]:
import pandas as pd
import numpy as np
import re
import os
import subprocess

In [14]:
# Download and unzip shivamb/netflix-shows dataset
!kaggle datasets download shivamb/netflix-shows 

Dataset URL: https://www.kaggle.com/datasets/shivamb/netflix-shows
License(s): CC0-1.0
netflix-shows.zip: Skipping, found more recently modified local copy (use --force to force download)


In [15]:
# Only unzip if netflix-shows.zip hasn't been unzipped already
dir_files = os.listdir(os.getcwd())
unzipped_csvs = ['netflix-shows.csv']
if set(dir_files).isdisjoint(set(unzipped_csvs)) and "netflix-shows.zip" in dir_files:
    subprocess.run(["unzip", "-o", "netflix-shows.zip"], check=True)

Archive:  netflix-shows.zip
  inflating: netflix_titles.csv      


Try reading in the data on your own!

In [25]:
# TODO: Read your data as a pandas dataframe. Save the dataframe to a variable named "netflix"
DATA_PATH = os.path.join(os.getcwd(), "netflix_titles.csv")
netflix = pd.read_csv("netflix_titles.csv")
netflix

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


# EDA

We have a much larger dataset than last week. What are all the columns and what do they mean? What does each row mean? Take a some time to look through the data and understand what we're working with.

Some useful functions: describe(), unique(), sort_values(), dtypes(), shape, columns, info(), isnull(), value_counts(), and more if you can think of them. Get used to exploring data like this so you know what you're working with first!

In [30]:
# TODO: Do your EDA here!
print("Shape of dataset:", netflix.shape)

netflix.columns

print(netflix.dtypes)

netflix.info()

display(netflix.head())

print("\nMissing values per column:")
print(netflix.isnull().sum())

print("\nUnique values per column:")
print(netflix.nunique())

Shape of dataset: (8807, 12)
show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...



Missing values per column:
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

Unique values per column:
show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64


# Granularity, Scope, and Temporality

Once you have completed your EDA, you should be able to answer each of these broad ideas about the data. If not, you can always do more EDA!

*Note, you can jot down some quick notes instead of a long answer to help remind yourself about certain characteristics of this dataframe

#### Granularity: What information does each row give us? How is each row unique (i.e., what identifies each row? This is called the **"primary key"**)

**Insert answer here**

Each row represents one Netflix title (either a movie or TV show). The unique identifier for each row is the show_id column, which acts as the primary key because it uniquely identifies each entry.

#### Scope: What are some interesting things we could learn from the data (we can find this out by exploring the columns we have)? Do we have to manipulate the data in some way to get what we're interested in?

**Insert answer here**

The scope of this dataset is Netflix titles and their metadata, including information such as title, rating, genre, release year, and cast. This allows us to analyze patterns such as content types, genres, and trends over time. However, the scope is limited because it only includes Netflix titles and does not include viewership data, popularity metrics, or user ratings. We may also need to manipulate the data, such as filtering, handling missing values, or splitting columns, depending on the specific question we want to answer.

#### Temporality: When was the data collected? How often is the data collected (if there is a pattern)? Do we need to adjust for consistency in the dates/times? Do we need to adjust the data type of our time variables?

**Insert answer here**

The dataset includes temporal information through columns such as release_year and date_added. These tell us when content was released and when it was added to Netflix. The data appears to have been collected from Netflix metadata and may be updated periodically. To analyze time trends, the date_added column may need to be converted to a datetime type so we can filter by year, month, or time ranges.

We will be having people share their responses to each of these questions and discuss what everybody found in the data!

# Faithfulness

This is all about deciding whether you can trust the data in the form it came in, or whether you need to make adjustments to do so.

Did you find any strange or inconsistent values? Can you figure out how the data was collected? Are there any duplicate values (in this case, there shouldn't be because each movie is a separate one)?

**Insert answer here**

I didn’t see obvious duplicate rows, and the show_id column looks like a unique identifier for each title, so that suggests each row represents one distinct movie or TV show. Some columns do have missing values though, especially things like director, cast, and country, which makes sense because that information might not always be available or recorded. The data looks like it was collected from a structured database (probably scraped or exported from Netflix metadata), since the formatting is consistent across rows. I didn’t notice weird impossible values, but some columns use text where a more specific data type might be better (like dates stored as strings).

Possibly one of the most important questions is what to do with default values. These can be NaN, NA, 0, or something else entirely. Based on what we see from this data, what should we do with our missing values, and should we take different steps for different columns? You can check the slides for some examples of what people do with missing data.

**Insert answer here**

We shouldn’t treat all missing values the same way, because different columns serve different purposes. For example, columns like release_year, type, or rating are important for analysis since we might count or compare them, so we’d need to handle missing values carefully there — possibly by removing rows or filling values depending on the analysis. But missing values in columns like director or cast don’t affect numerical summaries as much, so we could leave them as null or replace them with something like "Unknown". The best approach depends on what question we’re trying to answer, which matches what we learned about handling missing data differently depending on context.

We'll discuss again afterwards!

# Variable Typing

Just for practice, it's important to be able to identify what type of variable something is just by looking at it. As a refresher, the 4 types are Quantitative Continous, Quantitative Discrete, Qualitative Ordinal, and Qualitative Nominal.

#### Question: What are the variables types of
a) Director

b) Country

c) Date Added

d) Release Year


**Insert Answer Here**

a) Qualitative Nominal
b) Qualitative Nominal
c) Quantitave Discrete
d) Quantitative Discrete

# Main Exercise

A common, but often misguided, way to handle datasets that have outliers/untrustworthy data is to drop all rows or columns that have missing values.


Find the shape of our dataset before we do any data manipulation and compare it with the shape after you use the dropna function. What do you notice and how many columns or rows were removed from our data?

Tip: You should make a deep copy of a dataframe by using {data name}.copy(deep=True). Give it an appropriate name to reflect the fact that this will be the dataframe that will have all null values dropped without any previous operations performed on it.

In [37]:
# TODO: Find the shape of our original data vs the shape of our data 
print(netflix.shape)
netflix_no_nulls = netflix.copy(deep=True)
netflix_no_nulls = netflix_no_nulls.dropna()
print(netflix_no_nulls.shape)


(8807, 12)
(5332, 12)


Before removing missing values, the dataset had shape (8807, 12). After applying dropna(), it became (5332, 12). This means 3475 rows were removed, which is a large portion of the data. This shows that many entries contain at least one missing value, especially in columns like director, cast, or country. Because so many rows are lost, dropping all null values may not be the best approach. Instead, it may be better to handle missing values differently depending on the column (for example, filling categorical columns with "Unknown" rather than deleting the entire row).

You should've found that some of our data was removed after using the drop_na function. Let's go back to our original dataset before we used the function.

Try creating a list of the columns that have null values. Feel free to search things up on StackOverflow, the Pandas documentations, etc.

In [41]:
# TODO: Create a list of columns that have null values
cols_with_nulls = netflix.columns[netflix.isnull().any()]
print(cols_with_nulls)


Index(['director', 'cast', 'country', 'date_added', 'rating', 'duration'], dtype='object')


Now that we have the columns where missing values are present, we have a better idea of what the data types of our missing values actually area.

Try filling in the missing director values with an empty string "". You can check if you did this operation correctly by checking the columns that have missing values and replacing the original dataframe name in the code with the name of the copy you should make for this task.

In [None]:
# TODO: Fill in missing director values with an emoty string
netflix_copy = netflix.copy(deep=True)

netflix_copy["director"] = netflix_copy["director"].fillna("")

netflix_copy["director"]



0       Kirsten Johnson
1                      
2       Julien Leclercq
3                      
4                      
             ...       
8802      David Fincher
8803                   
8804    Ruben Fleischer
8805       Peter Hewitt
8806        Mozez Singh
Name: director, Length: 8807, dtype: object

Let's check if filling the "director" column will help reduce the number of rows that will be removed when we use the dropna function.

Use the dropna function on the dataframe that is created after filling in the "director" column and find its shape. Compare the shape of this dataframe to the shape of the original dataframe and the dataframe after dropping all null values.

In [56]:
# TODO: Compare size of original data, data when dropping na after replacing missing directors with empty string, and data with just dropna
print(netflix.shape)
print(netflix_copy.dropna().shape)
print(netflix.dropna().shape)

(8807, 12)
(7290, 12)
(5332, 12)


You should notice that the size of the dataframe that is created when we drop null values after filling in the "director" column will have less rows than the original dataframe, but more than the dataframe that is created when we just drop all values without performing any operations.

This should serve as a basic exercise on how you can approach data that is not 100% clean. Most data in consulting projects and in the real world will force you to find ways to balance the preservation of the original data and getting rid of the unecessary parts of the data.

### String Manipulation. We'll load in a new dataset for this part.

In [72]:
# TODO: load in the kazanova/sentiment140 data and unzip the file that is downloaded.
import subprocess

subprocess.run(["kaggle", "datasets", "download", "-d", "kazanova/sentiment140"], check=True)

subprocess.run(["unzip", "-o", "sentiment140.zip"], check=True)


Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  sentiment140.zip
  inflating: training.1600000.processed.noemoticon.csv  


CompletedProcess(args=['unzip', '-o', 'sentiment140.zip'], returncode=0)

In [73]:
# TODO: Read the data from the csv file that is generated from unzipping as a dataframe.
# You should use the latin-1 encoding for this step. Name the dataframe tweets_data
import pandas as pd

tweets_data = pd.read_csv(
    "training.1600000.processed.noemoticon.csv",
    encoding="latin-1",
    header=None
)

In [74]:
# Some cleanup of the data, this cell should run correctly if you named and read the dataframe properly
col_names = ["Polarity", "Tweet_ID", "Date", "Query", "Username", "Tweet"]

tweets_data = pd.read_csv(
    "training.1600000.processed.noemoticon.csv",
    encoding="latin-1",
    header=None,
    names=col_names
)

In [78]:
tweets = tweets_data.drop(columns=["Query"]).copy()

final_tweets = tweets.copy()  # or just use tweets directly
final_tweets.head()

Unnamed: 0,Polarity,Tweet_ID,Date,Username,Tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,Karoli,"@nationwideclass no, it's not behaving at all...."


#### Now that we have our **final_tweets** dataset, we can start working with it. Let's practice string methods first.

In [79]:
# TODO: Make all of the strings in the "Username" columns lowercase
final_tweets["Username"] = final_tweets["Username"].str.lower()
final_tweets.head()

Unnamed: 0,Polarity,Tweet_ID,Date,Username,Tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,_thespecialone_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ellectf,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,karoli,"@nationwideclass no, it's not behaving at all...."


#### Try making all the entries in the Tweet column uppercase and save this in a final_tweets_upper dataframe.

In [88]:
# TODO: make a copy of final_tweets and make the "tweets" column all uppercase
final_tweets_upper = final_tweets.copy()
final_tweets_upper["Tweet"] = final_tweets_upper["Tweet"].str.upper()
final_tweets_upper

Unnamed: 0,Polarity,Tweet_ID,Date,Username,Tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,_thespecialone_,"@SWITCHFOOT HTTP://TWITPIC.COM/2Y1ZL - AWWW, T..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,IS UPSET THAT HE CAN'T UPDATE HIS FACEBOOK BY ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@KENICHAN I DIVED MANY TIMES FOR THE BALL. MAN...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ellectf,MY WHOLE BODY FEELS ITCHY AND LIKE ITS ON FIRE
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,karoli,"@NATIONWIDECLASS NO, IT'S NOT BEHAVING AT ALL...."
...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,amandamarie1028,JUST WOKE UP. HAVING NO SCHOOL IS THE BEST FEE...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,thewdboards,THEWDB.COM - VERY COOL TO HEAR OLD WALT INTERV...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,bpbabe,ARE YOU READY FOR YOUR MOJO MAKEOVER? ASK ME F...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,tinydiamondz,HAPPY 38TH BIRTHDAY TO MY BOO OF ALLL TIME!!! ...


#### Now let's try to replace something in a string!

In [87]:
# TODO: create a dataframe that filters out the final_tweets dataframe's tweets and only shows tweets with underscores
tweeters_with_underscores = final_tweets[final_tweets["Username"].str.contains("_", na=False)]
tweeters_with_underscores

Unnamed: 0,Polarity,Tweet_ID,Date,Username,Tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,_thespecialone_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
5,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,joy_wolf,@Kwesidei not the whole crew
19,0,1467813782,Mon Apr 06 22:20:34 PDT 2009,gi_gi_bee,@FakerPattyPattz Oh dear. Were you drinking ou...
36,0,1467817225,Mon Apr 06 22:21:27 PDT 2009,crosland_12,@cocomix04 ill tell ya the story later not a ...
39,0,1467818007,Mon Apr 06 22:21:39 PDT 2009,anthony_nguyen,Bed. Class 8-12. Work 12-3. Gym 3-5 or 6. Then...
...,...,...,...,...,...
1599953,4,2193577592,Tue Jun 16 08:38:51 PDT 2009,chelsealately_,any ideaZ on what to get dad for father's day ...
1599974,4,2193578345,Tue Jun 16 08:38:55 PDT 2009,kristah_diggs,@yrclndstnlvr ahaha nooo you were just away fr...
1599979,4,2193578576,Tue Jun 16 08:38:57 PDT 2009,angel_sammy04,In the garden
1599980,4,2193578679,Tue Jun 16 08:38:56 PDT 2009,puchal_ek,@myheartandmind jo jen by nemuselo zrovna tÃ© ...


Task 2: In the cell below, use .str.replace to replace all the underscores in Username with a space (" ") in the tweeters_with__ dataframe. Output the changed tweeters_with__ dataframe.

Doc: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html

In [86]:
# TODO: replace all "_" with " " in the "Username" column
tweeters_with_space = final_tweets.copy()
tweeters_with_space["Username"] = tweeters_with_space["Username"].str.replace("_", " ", regex=False)
tweeters_with_space



Unnamed: 0,Polarity,Tweet_ID,Date,Username,Tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,thespecialone,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ellectf,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,amandamarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,thewdboards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


# Regex

---
### Part 1 — Does the title start with a number? 

Some Netflix titles start with a number, like `"13 Reasons Why"` or `"21 Jump Street"`.

**Task:** Use `str.contains()` with a regex pattern to find all titles that **start with a digit**. Store the result in a new DataFrame called `number_titles` and print how many there are.

**Hint:** Should be 130


In [90]:
# TODO: write a pattern that matches titles starting with a digit
pattern = r"^\d"
number_titles = netflix[netflix["title"].str.contains(pattern, na=False)]
print(f"Titles starting with a digit: {len(number_titles)}")
number_titles["title"].head(10)

Titles starting with a digit: 130


188                 2 Alone in Paris
323                          30 Rock
324                          44 Cats
404    9to5: The Story of a Movement
438                 2 Weeks in Lagos
558                        6 Bullets
774                         2 Hearts
850                         99 Songs
851                 99 Songs (Tamil)
852                99 Songs (Telugu)
Name: title, dtype: object

---
### Part 2 — Search descriptions for key words using `\w` and `[ ]`

**Task:** Find all titles whose description mentions the word `"love"` **or** `"romance"` (case-insensitive). Count them and print 5 examples.

Then, as a follow-up, find titles whose description contains **a number followed immediately by the word `"day"` or `"days"`** — e.g. `"30 days"` or `"1 day"`.

**Hints:**
- Pass `case=False` to `str.contains()` to ignore capitalisation.
- For the number + day pattern: `\d+` matches the number, `\s` matches the space, `days?` matches `"day"` or `"days"` (the `?` makes the `s` optional).
- Answer should be 826 and 9 respectively


In [91]:
# TODO: descriptions mentioning love or romance (case-insensitive)
pattern_love = r"love|romance"  # <-- your pattern here

love_titles = netflix[netflix["description"].str.contains(pattern_love, case=False, na=False)]
print(f"Titles mentioning love or romance: {len(love_titles)}")
love_titles["title"].head(5)

Titles mentioning love or romance: 826


24                   Jeans
25    Love on the Spectrum
26          Minsara Kanavu
27               Grown Ups
30         Ankahi Kahaniya
Name: title, dtype: object

In [92]:
# TODO: descriptions mentioning a number followed by "day" or "days"
pattern_days = r"\d+\sdays?" 

days_titles = netflix[netflix["description"].str.contains(pattern_days, case=False, na=False)]
print(f"Titles with 'N day(s)' in description: {len(days_titles)}")
days_titles[["title", "description"]].head(5)

Titles with 'N day(s)' in description: 9


Unnamed: 0,title,description
1565,Just The Way You Are,An overconfident teen bets he can make a homel...
2888,"Hi Bye, Mama!",When the ghost of a woman gains a second chanc...
4829,Sunday's Illness,Decades after being abandoned as a young child...
5031,Forgotten,When his abducted brother returns seemingly a ...
5893,Winter on Fire: Ukraine's Fight for Freedom,"Over 93 days in Ukraine, what started as peace..."
