# Preliminary Steps

In [4]:
import pandas as pd
import numpy as np
import subprocess
import re
import os

In [5]:
# Download and unzip shivamb/netflix-shows dataset
!kaggle datasets download shivamb/netflix-shows 

Dataset URL: https://www.kaggle.com/datasets/shivamb/netflix-shows
License(s): CC0-1.0
netflix-shows.zip: Skipping, found more recently modified local copy (use --force to force download)


In [6]:
# Only unzip if netflix-shows.zip hasn't been unzipped already
dir_files = os.listdir(os.getcwd())
unzipped_csvs = ['netflix-shows.csv']
if set(dir_files).isdisjoint(set(unzipped_csvs)) and "netflix-shows.zip" in dir_files:
    subprocess.run(["unzip", "-o", "netflix-shows.zip"], check=True)

Archive:  netflix-shows.zip
  inflating: netflix_titles.csv      


Try reading in the data on your own!

In [9]:
# TODO: Read your data as a pandas dataframe. Save the dataframe to a variable named "netflix"
DATA_PATH = os.path.join(os.getcwd(), "netflix_titles.csv")
netflix = DATA_PATH
netflix

'/Users/noahpadecky/Desktop/JCP26Notebooks/netflix_titles.csv'

# EDA

We have a much larger dataset than last week. What are all the columns and what do they mean? What does each row mean? Take a some time to look through the data and understand what we're working with.

Some useful functions: describe(), unique(), sort_values(), dtypes(), shape, columns, info(), isnull(), value_counts(), and more if you can think of them. Get used to exploring data like this so you know what you're working with first!

In [21]:
# TODO: EDA notes: Working with a CSV file, with both quantitative and catergorical variables, although categorical variables are far more prevalent. As for as Temporality goes, the movies/tv shows are generally more modern in regards to release date. 
df = pd.read_csv(netflix)
df.dtypes

show_id           str
type              str
title             str
director          str
cast              str
country           str
date_added        str
release_year    int64
rating            str
duration          str
listed_in         str
description       str
dtype: object

In [28]:
df[df['release_year'] > 1925]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [29]:
df.loc[:, ("date_added", "release_year")]

Unnamed: 0,date_added,release_year
0,"September 25, 2021",2020
1,"September 24, 2021",2021
2,"September 24, 2021",2021
3,"September 24, 2021",2021
4,"September 24, 2021",2021
...,...,...
8802,"November 20, 2019",2007
8803,"July 1, 2019",2018
8804,"November 1, 2019",2009
8805,"January 11, 2020",2006


In [30]:
df[df["country"] == "United States"]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...
15,s16,TV Show,Dear White People,,"Logan Browning, Brandon P. Bell, DeRon Horton,...",United States,"September 22, 2021",2021,TV-MA,4 Seasons,"TV Comedies, TV Dramas",Students of color navigate the daily slights a...
27,s28,Movie,Grown Ups,Dennis Dugan,"Adam Sandler, Kevin James, Chris Rock, David S...",United States,"September 20, 2021",2010,PG-13,103 min,Comedies,Mourning the loss of their beloved junior high...
28,s29,Movie,Dark Skies,Scott Stewart,"Keri Russell, Josh Hamilton, J.K. Simmons, Dak...",United States,"September 19, 2021",2013,PG-13,97 min,"Horror Movies, Sci-Fi & Fantasy",A family’s idyllic suburban life shatters when...
...,...,...,...,...,...,...,...,...,...,...,...,...
8791,s8792,Movie,Young Adult,Jason Reitman,"Charlize Theron, Patton Oswalt, Patrick Wilson...",United States,"November 20, 2019",2011,R,94 min,"Comedies, Dramas, Independent Movies",When a divorced writer gets a letter from an o...
8793,s8794,Movie,"Yours, Mine and Ours",Raja Gosnell,"Dennis Quaid, Rene Russo, Sean Faris, Katija P...",United States,"November 20, 2019",2005,PG,88 min,"Children & Family Movies, Comedies",When a father of eight and a mother of 10 prep...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...


In [31]:
df[df["rating"] == "PG"]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
41,s42,Movie,Jaws,Steven Spielberg,"Roy Scheider, Robert Shaw, Richard Dreyfuss, L...",United States,"September 16, 2021",1975,PG,124 min,"Action & Adventure, Classic Movies, Dramas",When an insatiable great white shark terrorize...
42,s43,Movie,Jaws 2,Jeannot Szwarc,"Roy Scheider, Lorraine Gary, Murray Hamilton, ...",United States,"September 16, 2021",1978,PG,116 min,"Dramas, Horror Movies, Thrillers",Four years after the last deadly shark attacks...
43,s44,Movie,Jaws 3,Joe Alves,"Dennis Quaid, Bess Armstrong, Simon MacCorkind...",United States,"September 16, 2021",1983,PG,98 min,"Action & Adventure, Horror Movies, Thrillers",After the staff of a marine theme park try to ...
45,s46,Movie,My Heroes Were Cowboys,Tyler Greco,,,"September 16, 2021",2021,PG,23 min,Documentaries,Robin Wiltshire's painful childhood was rescue...
...,...,...,...,...,...,...,...,...,...,...,...,...
8655,s8656,Movie,Unaccompanied Minors,Paul Feig,"Lewis Black, Wilmer Valderrama, Tyler James Wi...",United States,"October 1, 2019",2006,PG,90 min,"Children & Family Movies, Comedies","Five disparate kids, snowed in at the airport ..."
8701,s8702,Movie,Water & Power: A California Heist,Marina Zenovich,,United States,"February 1, 2018",2017,PG,78 min,Documentaries,California residents and farmers face powerful...
8776,s8777,Movie,Yellowbird,Christian De Vita,"Dakota Fanning, Seth Green, Christine Baranski...","France, Belgium","August 5, 2015",2014,PG,90 min,"Children & Family Movies, Comedies",An orphaned bird tags along with a flock on th...
8793,s8794,Movie,"Yours, Mine and Ours",Raja Gosnell,"Dennis Quaid, Rene Russo, Sean Faris, Katija P...",United States,"November 20, 2019",2005,PG,88 min,"Children & Family Movies, Comedies",When a father of eight and a mother of 10 prep...


In [33]:
df["title"].value_counts()

title
Dick Johnson Is Dead     1
Blood & Water            1
Ganglands                1
Jailbirds New Orleans    1
Kota Factory             1
                        ..
Zodiac                   1
Zombie Dumb              1
Zombieland               1
Zoom                     1
Zubaan                   1
Name: count, Length: 8807, dtype: int64

# Granularity, Scope, and Temporality

Once you have completed your EDA, you should be able to answer each of these broad ideas about the data. If not, you can always do more EDA!

*Note, you can jot down some quick notes instead of a long answer to help remind yourself about certain characteristics of this dataframe

#### Granularity: What information does each row give us? How is each row unique (i.e., what identifies each row? This is called the **"primary key"**)

Each row gives us a specific tv show/movie, along with a unique corresponding id.

#### Scope: What are some interesting things we could learn from the data (we can find this out by exploring the columns we have)? Do we have to manipulate the data in some way to get what we're interested in?

I would specifcialy be interested in the differences in certain metrics between movies and tv shows; some of the specific things we could glean from the data could be rating differences betweent the two categories as a whole, as well as the number of cast members varying between the two.

#### Temporality: When was the data collected? How often is the data collected (if there is a pattern)? Do we need to adjust for consistency in the dates/times? Do we need to adjust the data type of our time variables?

As previously stated, A majority of the movies/shows are recently produced, meaning that to take a deeper dive in analysis, we might have to segment within specific years, 2021 in particular. The time variables may need to be converted into integers(those besides release date.)

We will be having people share their responses to each of these questions and discuss what everybody found in the data!

# Faithfulness

This is all about deciding whether you can trust the data in the form it came in, or whether you need to make adjustments to do so.

Did you find any strange or inconsistent values? Can you figure out how the data was collected? Are there any duplicate values (in this case, there shouldn't be because each movie is a separate one)?

Noticably, there are some rows with nan values for the cast, director, and a few other variables. There seem to be no duplicates as each title value in the data set is unique.

In [34]:
df.info()


<class 'pandas.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   show_id       8807 non-null   str  
 1   type          8807 non-null   str  
 2   title         8807 non-null   str  
 3   director      6173 non-null   str  
 4   cast          7982 non-null   str  
 5   country       7976 non-null   str  
 6   date_added    8797 non-null   str  
 7   release_year  8807 non-null   int64
 8   rating        8803 non-null   str  
 9   duration      8804 non-null   str  
 10  listed_in     8807 non-null   str  
 11  description   8807 non-null   str  
dtypes: int64(1), str(11)
memory usage: 825.8 KB


Possibly one of the most important questions is what to do with default values. These can be NaN, NA, 0, or something else entirely. Based on what we see from this data, what should we do with our missing values, and should we take different steps for different columns? You can check the slides for some examples of what people do with missing data.

As many of the biggest null occurances are found in the director, cast, and country rating, and those are categorical variables, finding an average of said data points would be difficult. With that being said, I think the best thing to do would be to replace the values with a 0. In many shows, there are not directors because production is led by the producer team, and animated shows may not have a delegated "cast". As a result of this it seems wise to consider these cateogies as having a 0 in those respective categories. 

We'll discuss again afterwards!

# Variable Typing

Just for practice, it's important to be able to identify what type of variable something is just by looking at it. As a refresher, the 4 types are Quantitative Continous, Quantitative Discrete, Qualitative Ordinal, and Qualitative Nominal.

#### Question: What are the variables types of
a) Director

b) Country

c) Date Added

d) Release Year


a) categorical nominal
b) categorical nominal
c)quantifiable discrete  
d) quantifiable discrete

# Main Exercise

A common, but often misguided, way to handle datasets that have outliers/untrustworthy data is to drop all rows or columns that have missing values.


Find the shape of our dataset before we do any data manipulation and compare it with the shape after you use the dropna function. What do you notice and how many columns or rows were removed from our data?

Tip: You should make a deep copy of a dataframe by using {data name}.copy(deep=True). Give it an appropriate name to reflect the fact that this will be the dataframe that will have all null values dropped without any previous operations performed on it.

In [52]:


dropped = df.copy(deep=True)
dropped = dropped.dropna()
dropped.shape




(5332, 12)

You should've found that some of our data was removed after using the drop_na function. Let's go back to our original dataset before we used the function.

Try creating a list of the columns that have null values. Feel free to search things up on StackOverflow, the Pandas documentations, etc.

In [54]:
null_colls = df.columns[df.isnull().any()]
print(null_colls)

Index(['director', 'cast', 'country', 'date_added', 'rating', 'duration'], dtype='str')


Now that we have the columns where missing values are present, we have a better idea of what the data types of our missing values actually area.

Try filling in the missing director values with an empty string "". You can check if you did this operation correctly by checking the columns that have missing values and replacing the original dataframe name in the code with the name of the copy you should make for this task.

In [56]:
# TODO: Fill in missing director values with an emoty string
director_fill= df.copy(deep=True)
director_fill["director"] = director_fill["director"].fillna("")


Let's check if filling the "director" column will help reduce the number of rows that will be removed when we use the dropna function.

Use the dropna function on the dataframe that is created after filling in the "director" column and find its shape. Compare the shape of this dataframe to the shape of the original dataframe and the dataframe after dropping all null values.

In [60]:
# TODO: Compare size of original data, data when dropping na after replacing missing directors with empty string, and data with just dropna
new = director_fill.dropna()
print(new.shape)
print(df.shape)
print(dropped.shape)

(7290, 12)
(8807, 12)
(5332, 12)


You should notice that the size of the dataframe that is created when we drop null values after filling in the "director" column will have less rows than the original dataframe, but more than the dataframe that is created when we just drop all values without performing any operations.

This should serve as a basic exercise on how you can approach data that is not 100% clean. Most data in consulting projects and in the real world will force you to find ways to balance the preservation of the original data and getting rid of the unecessary parts of the data.

### String Manipulation. We'll load in a new dataset for this part.

In [63]:
# TODO: load in the kazanova/sentiment140 data and unzip the file that is downloaded.
!kaggle datasets download kazanova/sentiment140


Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to /Users/noahpadecky/Desktop/JCP26Notebooks
100%|██████████████████████████████████████| 80.9M/80.9M [00:03<00:00, 22.4MB/s]



In [None]:
# TODO: Read the data from the csv file that is generated from unzipping as a dataframe.
# You should use the latin-1 encoding for this step. Name the dataframe tweets_data

dir_files = os.listdir(os.getcwd())
unzipped_csvs = ['training.1600000.processed.noemoticon.csv']
if set(dir_files).isdisjoint(set(unzipped_csvs)) and "sentiment140.zip" in dir_files:
    subprocess.run(["unzip", "-o", "sentiment140.zip"], check=True)


fill_in = os.path.join(os.getcwd(), "training.1600000.processed.noemoticon.csv")
tweets_data = pd.read_csv(fill_in, encoding="latin-1", header=None)
tweets_data

Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [88]:
# Some cleanup of the data, this cell should run correctly if you named and read the dataframe properly
tweets = tweets_data.drop(columns = [3]).rename(columns={0: "Polarity",
                                                                  1: "id",
                                                                  2: "Date",
                                                                  4: "Username",
                                                                  5: 'tweets'})
tweets.head(5)

Unnamed: 0,Polarity,id,Date,Username,tweets
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,Karoli,"@nationwideclass no, it's not behaving at all...."


#### Now that we have our **final_tweets** dataset, we can start working with it. Let's practice string methods first.

In [92]:
# TODO: Make all of the strings in the "Username" columns lowercase
tweets_data.columns = ["Polarity", "id", "Date", "NO_QUERY", "Username", "Tweet"]
tweets_data["Username"] = tweets_data["Username"].str.lower()


#### Try making all the entries in the Tweet column uppercase and save this in a final_tweets_upper dataframe.

In [93]:
# TODO: make a copy of final_tweets and make the "tweets" column all uppercase
tweets_upper = final_tweets.copy(deep=True)
tweets_upper["Tweet"] = tweets_upper["Tweet"].str.upper()

#### Now let's try to replace something in a string!

In [94]:
# TODO: create a dataframe that filters out the final_tweets dataframe's tweets and only shows tweets with underscores
tweets_with_underscore = final_tweets[final_tweets["Tweet"].str.contains("_", na=False)]

Task 2: In the cell below, use .str.replace to replace all the underscores in Username with a space (" ") in the tweeters_with__ dataframe. Output the changed tweeters_with__ dataframe.

Doc: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html

In [98]:
# TODO: replace all "_" with " " in the "Username" column
tweets_data.columns = ["Polarity", "id", "Date", "NO_QUERY", "Username", "Tweet"]
final_tweets = tweets_data.drop(columns=["NO_QUERY"])
tweets_with_underscore = final_tweets[final_tweets["Tweet"].str.contains("_", na=False)]

# Regex

---
### Part 1 — Does the title start with a number? 

Some Netflix titles start with a number, like `"13 Reasons Why"` or `"21 Jump Street"`.

**Task:** Use `str.contains()` with a regex pattern to find all titles that **start with a digit**. Store the result in a new DataFrame called `number_titles` and print how many there are.

**Hint:** Should be 130


In [101]:
# TODO: write a pattern that matches titles starting with a digit
import pandas as pd

netflix = pd.read_csv("netflix_titles.csv")
pattern = r"^\d"
number_titles = netflix[netflix["title"].str.contains(pattern, na=False)]
print(f"Titles starting with a digit: {len(number_titles)}")
number_titles["title"].head(10)

Titles starting with a digit: 130


188                 2 Alone in Paris
323                          30 Rock
324                          44 Cats
404    9to5: The Story of a Movement
438                 2 Weeks in Lagos
558                        6 Bullets
774                         2 Hearts
850                         99 Songs
851                 99 Songs (Tamil)
852                99 Songs (Telugu)
Name: title, dtype: str

---
### Part 2 — Search descriptions for key words using `\w` and `[ ]`

**Task:** Find all titles whose description mentions the word `"love"` **or** `"romance"` (case-insensitive). Count them and print 5 examples.

Then, as a follow-up, find titles whose description contains **a number followed immediately by the word `"day"` or `"days"`** — e.g. `"30 days"` or `"1 day"`.

**Hints:**
- Pass `case=False` to `str.contains()` to ignore capitalisation.
- For the number + day pattern: `\d+` matches the number, `\s` matches the space, `days?` matches `"day"` or `"days"` (the `?` makes the `s` optional).
- Answer should be 826 and 9 respectively


In [103]:
# TODO: descriptions mentioning love or romance (case-insensitive)
pattern_love = r"love|romance" # <-- your pattern here

love_titles = netflix[netflix["description"].str.contains(pattern_love, case=False, na=False)]
print(f"Titles mentioning love or romance: {len(love_titles)}")
love_titles["title"].head(5)

Titles mentioning love or romance: 826


24                   Jeans
25    Love on the Spectrum
26          Minsara Kanavu
27               Grown Ups
30         Ankahi Kahaniya
Name: title, dtype: str

In [104]:
# TODO: descriptions mentioning a number followed by "day" or "days"
pattern_days = r"\d+\sday[s]?" 

days_titles = netflix[netflix["description"].str.contains(pattern_days, case=False, na=False)]
print(f"Titles with 'N day(s)' in description: {len(days_titles)}")
days_titles[["title", "description"]].head(5)

Titles with 'N day(s)' in description: 9


Unnamed: 0,title,description
1565,Just The Way You Are,An overconfident teen bets he can make a homel...
2888,"Hi Bye, Mama!",When the ghost of a woman gains a second chanc...
4829,Sunday's Illness,Decades after being abandoned as a young child...
5031,Forgotten,When his abducted brother returns seemingly a ...
5893,Winter on Fire: Ukraine's Fight for Freedom,"Over 93 days in Ukraine, what started as peace..."
