# Preliminary Steps

In [5]:
import pandas as pd
import numpy as np
import re
import os
import subprocess


In [6]:
# TODO: Download and unzip shivamb/netflix-shows dataset
!kaggle datasets download shivamb/netflix-shows 

Dataset URL: https://www.kaggle.com/datasets/shivamb/netflix-shows
License(s): CC0-1.0
netflix-shows.zip: Skipping, found more recently modified local copy (use --force to force download)


Try reading in the data on your own!

In [8]:
# Only unzip if starbucks-menu.zip hasn't been unzipped already
dir_files = os.listdir(os.getcwd())
unzipped_csvs = ['netflix-shows.csv']
if set(dir_files).isdisjoint(set(unzipped_csvs)) and "netflix-shows.zip" in dir_files:
    subprocess.run(["unzip", "-o", "netflix-shows.zip"], check=True)


Archive:  netflix-shows.zip
  inflating: netflix_titles.csv      


In [9]:
# TODO: Read your data as a pandas dataframe. Save the dataframe to a variable named "netflix"
DATA_PATH = os.path.join(os.getcwd(), "netflix_titles.csv")
netflix = pd.read_csv(DATA_PATH)
netflix

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


# EDA

We have a much larger dataset than last week. What are all the columns and what do they mean? What does each row mean? Take a some time to look through the data and understand what we're working with.

Some useful functions: describe(), unique(), sort_values(), dtypes(), shape, columns, info(), isnull(), value_counts(), and more if you can think of them. Get used to exploring data like this so you know what you're working with first!

In [10]:
# TODO: Do your EDA here!
netflix.describe()
#netflix.sort_values(['date_added'],ascending=False)
#netflix.columns

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


# Granularity, Scope, and Temporality

Once you have completed your EDA, you should be able to answer each of these broad ideas about the data. If not, you can always do more EDA!

*Note, you can jot down some quick notes instead of a long answer to help remind yourself about certain characteristics of this dataframe

#### Granularity: What information does each row give us? How is each row unique (i.e., what identifies each row? This is called the **"primary key"**)

Each row represents a movie or show on Netflix. Each row has a unique showid. They have their personal titles, directors, and other information. 

#### Scope: What are some interesting things we could learn from the data (we can find this out by exploring the columns we have)? Do we have to manipulate the data in some way to get what we're interested in?

We have mostly qualitative information on the dataset. We would need to understand the different categories in the dataset. We can figure out the type of movie depending on release year. 

#### Temporality: When was the data collected? How often is the data collected (if there is a pattern)? Do we need to adjust for consistency in the dates/times? Do we need to adjust the data type of our time variables?

The data was last updated around 3 years ago. The range of the movies and shows added in the dataset is between 2018 and 2021. We should adjust for consistency as there are shows/movies from previous years that have not been added. 

We will be having people share their responses to each of these questions and discuss what everybody found in the data!

# Faithfulness

This is all about deciding whether you can trust the data in the form it came in, or whether you need to make adjustments to do so.

Did you find any strange or inconsistent values? Can you figure out how the data was collected? Are there any duplicate values (in this case, there shouldn't be because each movie is a separate one)?

There are many instances where the country or director is missing. I'm not sure where the data is collected from, but most likely from a couple movie released websites. I do not notice any repeats in the data.

Possibly one of the most important questions is what to do with default values. These can be NaN, NA, 0, or something else entirely. Based on what we see from this data, what should we do with our missing values, and should we take different steps for different columns? You can check the slides for some examples of what people do with missing data.

**Insert answer here**

Since the country is a categorical value, we can't do imputation as it is hard to assume where the country is originated from. It is best to remove the data points where knowing the country is important. 

# Variable Typing

Just for practice, it's important to be able to identify what type of variable something is just by looking at it. As a refresher, the 4 types are Quantitative Continous, Quantitative Discrete, Qualitative Ordinal, and Qualitative Nominal.

#### Question: What are the variables types of
a) Director

b) Country

c) Date Added

d) Release Year


a) qualitative nominal
b) qualitative nominal
c) quantitative discrete
d) quantitative discrete

# Main Exercise

A common, but often misguided, way to handle datasets that have outliers/untrustworthy data is to drop all rows or columns that have missing values.


Find the shape of our dataset before we do any data manipulation and compare it with the shape after you use the dropna function. What do you notice and how many columns or rows were removed from our data?

Tip: You should make a deep copy of a dataframe by using {data name}.copy(deep=True). Give it an appropriate name to reflect the fact that this will be the dataframe that will have all null values dropped without any previous operations performed on it.

In [27]:
# TODO: Find the shape of our original data vs the shape of our data after we dropna
dropped_netflix = netflix.copy(deep=True)
dropped_netflix = dropped_netflix.dropna()
print(dropped_netflix.shape)
print(netflix.shape)

(5332, 12)
(8807, 12)


You should've found that some of our data was removed after using the drop_na function. Let's go back to our original dataset before we used the function. 

Try creating a list of the columns that have null values. Feel free to search things up on StackOverflow, the Pandas documentations, etc.

In [28]:
# TODO: Create a list of columns that have null values
dropped_netflix.columns[netflix.isnull().any()].tolist()

['director', 'cast', 'country', 'date_added', 'rating', 'duration']

Now that we have the columns where missing values are present, we have a better idea of what the data types of our missing values actually area.

Try filling in the missing director values with an empty string "". You can check if you did this operation correctly by checking the columns that have missing values and replacing the original dataframe name in the code with the name of the copy you should make for this task.

In [70]:
# TODO: Fill in missing director values with an emoty string
netflix["director"] = netflix["director"].fillna('')
netflix

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


Let's check if filling the "director" column will help reduce the number of rows that will be removed when we use the dropna function. 

Use the dropna function on the dataframe that is created after filling in the "director" column and find its shape. Compare the shape of this dataframe to the shape of the original dataframe and the dataframe after dropping all null values.

In [72]:
# TODO: Compare size of original data, data when dropping na after replacing missing directors with empty string, and data with just dropna
print(netflix.shape)
print(dropped_netflix.shape)
print(netflix.dropna().shape)

(8807, 12)
(5332, 12)
(7290, 12)


You should notice that the size of the dataframe that is created when we drop null values after filling in the "director" column will have less rows than the original dataframe, but more than the dataframe that is created when we just drop all values without performing any operations.

This should serve as a basic exercise on how you can approach data that is not 100% clean. Most data in consulting projects and in the real world will force you to find ways to balance the preservation of the original data and getting rid of the unecessary parts of the data.

### String Manipulation. We'll load in a new dataset for this part.

In [73]:
# TODO: load in the kazanova/sentiment140 data and unzip the file that is downloaded.
!kaggle datasets download kazanova/sentiment140 

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)


In [76]:
dir_files = os.listdir(os.getcwd())
unzipped_csvs = ['training.1600000.processed.noemoticon.csv']
if set(dir_files).isdisjoint(set(unzipped_csvs)) and "straining.1600000.processed.noemoticon.zip" in dir_files:
    !unzip training.1600000.processed.noemoticon.zip 


In [78]:
DATA_PATH = os.path.join(os.getcwd(), "sentiment140.csv")
tweets_data = pd.read_csv(DATA_PATH, encoding='latin-1')
tweets_data


Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [79]:
# Some cleanup of the data, this cell should run correctly if you named and read the dataframe properly
tweets = tweets_data.drop(columns = ["NO_QUERY"]).rename(columns={"0": "Polarity",
                                                                  "1467810369": "id",
                                                                  "Mon Apr 06 22:19:45 PDT 2009" : "Date",
                                                                  "_TheSpecialOne_" : "Username"})
final_tweets = tweets.set_axis([*tweets.columns[:-1], 'Tweet'], axis=1)
final_tweets.head(5)

Unnamed: 0,Polarity,id,Date,Username,Tweet
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,joy_wolf,@Kwesidei not the whole crew


#### Now that we have our **final_tweets** dataset, we can start working with it. Let's practice string methods first.

In [84]:
# TODO: Make all of the strings in the "Username" columns lowercase
final_tweets["Username"] = final_tweets["Username"].str.lower()
final_tweets


Unnamed: 0,Polarity,id,Date,Username,Tweet
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ellectf,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,amandamarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,thewdboards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


#### Try making all the entries in the Tweet column uppercase and save this in a final_tweets_upper dataframe.

In [85]:
# TODO: make a copy of final_tweets and make the "tweets" column all uppercase
upper_final_tweets = final_tweets
upper_final_tweets["Username"] = upper_final_tweets["Username"].str.upper()
upper_final_tweets

Unnamed: 0,Polarity,id,Date,Username,Tweet
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,SCOTTHAMILTON,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,MATTYCUS,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ELLECTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,KAROLI,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,JOY_WOLF,@Kwesidei not the whole crew
...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,AMANDAMARIE1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,THEWDBOARDS,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,BPBABE,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,TINYDIAMONDZ,Happy 38th Birthday to my boo of alll time!!! ...


#### Now let's try to replace something in a string!

In [87]:
# TODO: create a dataframe that filters out the final_tweets dataframe's tweets and only shows tweets with underscores
final_tweets_underscore = final_tweets[final_tweets['Tweet'].str.contains('_')]
final_tweets_underscore

Unnamed: 0,Polarity,id,Date,Username,Tweet
7,0,1467811795,Mon Apr 06 22:20:05 PDT 2009,2HOOD4HOLLYWOOD,@Tatiana_K nope they didn't have it
21,0,1467814119,Mon Apr 06 22:20:40 PDT 2009,COOLIODOC,@angry_barista I baked you a cake but I ated it
68,0,1467825084,Mon Apr 06 22:23:30 PDT 2009,PRESIDENTSNOW,"@Lt_Algonquin agreed, I saw the failwhale alll..."
108,0,1467838188,Mon Apr 06 22:26:54 PDT 2009,JESS_HIGLEY,@marykatherine_q i know! I heard it this after...
118,0,1467839586,Mon Apr 06 22:27:18 PDT 2009,SONYOLMOS,@eRRe_sC aaw i miss ya all too.. im leaving to...
...,...,...,...,...,...
1599986,4,2193579092,Tue Jun 16 08:38:58 PDT 2009,CATHRIIIN,@La_r_a NEVEER I think that you both will get...
1599987,4,2193579191,Tue Jun 16 08:38:59 PDT 2009,TELLMAN,@Roy_Everitt ha- good job. that's right - we g...
1599988,4,2193579211,Tue Jun 16 08:38:59 PDT 2009,JAZZSTIXX,@Ms_Hip_Hop im glad ur doing well
1599992,4,2193579477,Tue Jun 16 08:39:00 PDT 2009,CHLOEAMISHA,@SCOOBY_GRITBOYS


Task 2: In the cell below, use .str.replace to replace all the underscores in Username with a space (" ") in the tweeters_with__ dataframe. Output the changed tweeters_with__ dataframe.

Doc: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html

In [88]:
# TODO: replace all "_" with " " in the "Username" column
tweeters_with__ = final_tweets
tweeters_with__["Username"] = tweeters_with__["Username"].str.replace("_", " ")
tweeters_with__

Unnamed: 0,Polarity,id,Date,Username,Tweet
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,SCOTTHAMILTON,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,MATTYCUS,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ELLECTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,KAROLI,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,JOY WOLF,@Kwesidei not the whole crew
...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,AMANDAMARIE1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,THEWDBOARDS,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,BPBABE,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,TINYDIAMONDZ,Happy 38th Birthday to my boo of alll time!!! ...
