<a href="https://colab.research.google.com/github/monozi/CCDATSCL_EXERCISES_COM222/blob/main/Exercise%20%232.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 2

<img src="https://vsqfvsosprmjdktwilrj.supabase.co/storage/v1/object/public/images/insights/1753644539114-netflix.jpeg"/>


In this activity , you will explore two fundamental preprocessing techniques used in data science and machine learning: feature scaling and discretization (binning).

These techniques are essential when working with datasets that contain numerical values on very different scales, or continuous variables that may be more useful when grouped into categories.


We will use a subset of the Netflix Movies and TV Shows dataset, which contains metadata such as release year, duration, ratings, and other attributes of titles currently or previously available on Netflix. Although the dataset is not originally designed for numerical modeling, it contains several features suitable for preprocessing practice—such as:
-Release Year
-Duration (in minutes)
-Number of Cast Members
-Number of Listed Genres
-Title Word Count

In this worksheet, you will:
- Load and inspect the dataset
- Select numerical features for scaling
- Apply different scaling techniques
- Min–Max Scaling
- Standardization
- Robust Scaling
- Perform discretization (binning)
- Equal-width binning
- Equal-frequency binning
- Evaluate how scaling affects machine learning performance, using a simple KNN

In [1]:
import pandas as pd
import os
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub


## 1. Setup and Data Loading



Load the Netflix dataset into a DataFrame named df.

In [2]:

# Download latest version
path = kagglehub.dataset_download("shivamb/netflix-shows")

print("Path to dataset files:", path)


if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

Using Colab cache for faster access to the 'netflix-shows' dataset.
Path to dataset files: /kaggle/input/netflix-shows
True


## 2. Data Understanding

Store the dataset’s column names in a variable called cols.

In [4]:
cols = df.columns
display(cols)

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [10]:
df.dtypes

Unnamed: 0,0
show_id,object
type,object
title,object
director,object
cast,object
country,object
date_added,object
release_year,int64
rating,object
duration,object


Store the shape of the dataset as a tuple (rows, columns) in shape_info.

In [5]:
# put your answer here
shape_info = df.shape
shape_info

(8807, 12)

## 3. Data Cleaning
Count missing values per column and save to missing_counts.

In [6]:
# put your answer here
missing_counts = df.isna().sum()
missing_counts

Unnamed: 0,0
show_id,0
type,0
title,0
director,2634
cast,825
country,831
date_added,10
release_year,0
rating,4
duration,3


Drop rows where duration is missing. Save to df_clean.

In [7]:
# put your answer here
df_clean = df.dropna(subset=['duration'])
df_clean

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [39]:
df.head(30)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


4. Selecting Relevant Numeric Features

Many Netflix datasets include numeric fields such as:
- release_year
- duration
- rating


Create a DataFrame `df_num` containing only numeric columns.

In [17]:
# put your answer here
df_num = df.select_dtypes(include=['number'])
df_num

Unnamed: 0,release_year
0,2020
1,2021
2,2021
3,2021
4,2021
...,...
8802,2007
8803,2018
8804,2009
8805,2006


## 5. Feature Scaling

Focus on a single numeric column (e.g., duration).


Extract the column duration into a Series named `dur`.

In [38]:
df_processed = df_clean.copy()
df_processed['duration_minutes'] = 0

# Process Movies
movie_mask = df_processed['type'] == 'Movie'
df_processed.loc[movie_mask, 'duration_minutes'] = (
    df_processed.loc[movie_mask, 'duration'].str.split(' ').str[0].astype(int)
)

# Process TV Shows
tv_show_mask = df_processed['type'] == 'TV Show'

# Extract number of seasons
df_processed.loc[tv_show_mask, 'num_seasons'] = (
    df_processed.loc[tv_show_mask, 'duration'].str.split(' ').str[0].astype(int)
)

# Initialize default values for TV shows
df_processed.loc[tv_show_mask, 'episodes_per_season'] = 15
df_processed.loc[tv_show_mask, 'episode_length_in_minutes'] = 30

# Apply Miniseries specific values
miniseries_mask = tv_show_mask & df_processed['listed_in'].str.contains('Miniseries', na=False)
df_processed.loc[miniseries_mask, 'episodes_per_season'] = 5
df_processed.loc[miniseries_mask, 'episode_length_in_minutes'] = 60

# Apply specific TV show sub-genres values
tv_genres = [
    'Kids\' TV', 'Anime Series', 'Korean TV Shows', 'British TV Shows',
    'Spanish-Language TV Shows', 'Teen TV Shows', 'Science & Nature TV',
    'TV Action & Adventure', 'TV Sci-Fi & Fantasy', 'TV Horror', 'TV Thrillers',
    'Stand-Up Comedy & Talk Shows'
]

for genre in tv_genres:
    genre_mask = tv_show_mask & df_processed['listed_in'].str.contains(genre, na=False)
    df_processed.loc[genre_mask, 'episodes_per_season'] = 10
    df_processed.loc[genre_mask, 'episode_length_in_minutes'] = 45

# Calculate total duration_minutes for TV shows
df_processed.loc[tv_show_mask, 'duration_minutes'] = (
    df_processed.loc[tv_show_mask, 'num_seasons'] *
    df_processed.loc[tv_show_mask, 'episodes_per_season'] *
    df_processed.loc[tv_show_mask, 'episode_length_in_minutes']
)

# Store the final duration_minutes column into a series named 'dur'
dur = df_processed['duration_minutes']

# Drop temporary columns used for TV show calculations
df_processed.drop(columns=['num_seasons', 'episodes_per_season', 'episode_length_in_minutes'], inplace=True)

display(dur.head(10))

Unnamed: 0,duration_minutes
0,90
1,900
2,450
3,450
4,900
5,450
6,91
7,125
8,4050
9,104


Apply Min–Max Scaling to `dur`. Store the result as `dur_minmax`.

In [43]:
from sklearn.preprocessing import MinMaxScaler

# Reshape dur to a 2D array as MinMaxScaler expects 2D input
dur_reshaped = dur.values.reshape(-1, 1)

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Apply Min-Max Scaling
dur_minmax = pd.Series(scaler.fit_transform(dur_reshaped).flatten(), index=dur.index)

Apply Z-score Standardization to `dur`. Store in `dur_zscore`.

In [44]:
# put your answer here
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler = StandardScaler()

# Apply Z-score Standardization
dur_zscore = scaler.fit_transform(dur_reshaped)

# Convert back to a Series for easier handling, if desired
dur_zscore = pd.Series(dur_zscore.flatten(), index=dur.index)

## 6. Discretization (Binning)
Apply equal-width binning to dur into 5 bins. Store as `dur_width_bins`.


- Use `pandas.cut()` to divide duration_minutes into 4 `equal-width bins`.
- Add the resulting bins as a new column named:
`duration_equal_width_bin`

In [59]:
# put your answer here
dur_width_bins = pd.cut(dur, bins=5)
df_processed['duration_equal_width_bin'] = dur_width_bins
display(df_processed[['duration_minutes', 'duration_equal_width_bin']].head())

Unnamed: 0,duration_minutes,duration_equal_width_bin
0,90,"(-4.647, 1532.4]"
1,900,"(-4.647, 1532.4]"
2,450,"(-4.647, 1532.4]"
3,450,"(-4.647, 1532.4]"
4,900,"(-4.647, 1532.4]"


Describe the characteristics of each bin

- What are the bin edges produced by equal-width binning?
- How many movies fall into each bin?

In [60]:
# put your answer here
bin_edges = pd.cut(dur, bins=5, retbins=True)[1]
print("Equal-width bin edges:", bin_edges)

bin_counts = df_processed['duration_equal_width_bin'].value_counts().sort_index()
print("\nNumber of movies/TV shows in each equal-width bin:")
display(bin_counts)

Equal-width bin edges: [-4.6470e+00  1.5324e+03  3.0618e+03  4.5912e+03  6.1206e+03  7.6500e+03]

Number of movies/TV shows in each equal-width bin:


Unnamed: 0_level_0,count
duration_equal_width_bin,Unnamed: 1_level_1
"(-4.647, 1532.4]",8545
"(1532.4, 3061.8]",193
"(3061.8, 4591.2]",56
"(4591.2, 6120.6]",7
"(6120.6, 7650.0]",3


Apply equal-frequency binning to dur into 5 bins. Store as `dur_quantile_bins`.

- Use `pandas.qcut()` to divide duration_minutes into 4 equal-frequency bins.
- Add the result as a new column named:
`duration_equal_freq_bin`

In [63]:
# put your answer here
dur_quantile_bins = pd.qcut(dur, q=5, duplicates='drop')
df_processed['duration_equal_freq_bin'] = dur_quantile_bins
display(df_processed[['duration_minutes', 'duration_equal_freq_bin']].head())

Unnamed: 0,duration_minutes,duration_equal_freq_bin
0,90,"(89.0, 102.0]"
1,900,"(450.0, 7650.0]"
2,450,"(127.0, 450.0]"
3,450,"(127.0, 450.0]"
4,900,"(450.0, 7650.0]"


Describe the characteristics of each bin

- What are the bin ranges produced by equal-frequency binning?
- How many movies fall into each bin? Are they nearly equal?

In [64]:
# put your answer here
bin_ranges = pd.qcut(dur, q=5, retbins=True, duplicates='drop')[1]
print("Equal-frequency bin ranges:", bin_ranges)

bin_counts_freq = df_processed['duration_equal_freq_bin'].value_counts().sort_index()
print("\nNumber of movies/TV shows in each equal-frequency bin:")
display(bin_counts_freq)
print("\nAre the counts nearly equal? (True if standard deviation is small):")
print(bin_counts_freq.std() < (bin_counts_freq.mean() * 0.1)) # Check if std is less than 10% of mean

Equal-frequency bin ranges: [3.00e+00 8.90e+01 1.02e+02 1.27e+02 4.50e+02 7.65e+03]

Number of movies/TV shows in each equal-frequency bin:


Unnamed: 0_level_0,count
duration_equal_freq_bin,Unnamed: 1_level_1
"(2.999, 89.0]",1838
"(89.0, 102.0]",1714
"(102.0, 127.0]",1757
"(127.0, 450.0]",2612
"(450.0, 7650.0]",883



Are the counts nearly equal? (True if standard deviation is small):
False


## 7. KNN Before & After Scaling


Create a feature matrix X using any two numeric columns and a target y (e.g., classification by genre or type). Create a train/test split.

Train a KNN classifier without scaling. Store accuracy in acc_raw.

In [57]:
# put your answer here
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df_processed[['release_year', 'duration_minutes']]
y = df_processed['type']

# Create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (6162, 2)
Shape of X_test: (2642, 2)
Shape of y_train: (6162,)
Shape of y_test: (2642,)


Scale `X` using either Min–Max or Standardization, retrain KNN, and store accuracy in acc_scaled.

In [58]:
# put your answer here
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize KNN classifier without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5) # Using default n_neighbors=5

# Train the classifier
knn_raw.fit(X_train, y_train)

# Make predictions on the test set
y_pred_raw = knn_raw.predict(X_test)

# Calculate accuracy
acc_raw = accuracy_score(y_test, y_pred_raw)

print(f"Accuracy of KNN without scaling: {acc_raw:.4f}")

Accuracy of KNN without scaling: 1.0000


In [56]:
from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit on training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize KNN classifier
knn_scaled = KNeighborsClassifier(n_neighbors=5)

# Train the classifier on scaled data
knn_scaled.fit(X_train_scaled, y_train)

# Make predictions on the scaled test set
y_pred_scaled = knn_scaled.predict(X_test_scaled)

# Calculate accuracy
acc_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy of KNN with Min-Max scaling: {acc_scaled:.4f}")

Accuracy of KNN with Min-Max scaling: 0.9985


Did scaling improve accuracy? Explain why.

In [None]:
# put your answer here
# The scaling slightly decreased accuracy. The KNN classifier without scaling achieved an accuracy of 1.00, while with Min-Max scaling, the accuracy was 0.9985. This is because the original features already had a distribution suitable for KNN, or if scaling introduced some noise or compressed relevant differences for this specific dataset and model combination.
# In some cases, if the features are already on a similar scale or if one feature's absolute magnitude is genuinely more important, scaling might not help or could even affect the performance.