In [None]:
#Original dataset: https://www.kaggle.com/datasets/rezaghari/friends-series-dataset?select=friends_episodes_v3.csv

# Finding the Weirdest FRIENDS Episodes
The goal of this notebook is to identify the "weirdest" episodes of the TV show FRIENDS using natural language processing and unsupervised anomaly detection. By embedding the episode titles and summaries into high-dimensional vectors, we can use machine learning to spot episodes that stand out from the rest in terms of their content.

---

## Data Loading and Preparation: 
We start by loading a dataset of FRIENDS episodes, which includes the episode title and a brief summary for each episode. 

In [2]:
import pandas as pd
original_df = pd.read_csv("friends_episodes_v3.csv", encoding='cp1252')
original_df['Summary'] = original_df['Summary'].apply(lambda x: x.encode('utf-8', 'replace').decode('utf-8'))
original_df.head()

Unnamed: 0,Year_of_prod,Season,Episode Number,Episode_Title,Duration,Summary,Director,Stars,Votes
0,1994,1,1,The One Where Monica Gets a Roommate: The Pilot,22,"Monica and the gang introduce Rachel to the ""r...",James Burrows,8.3,7440
1,1994,1,2,The One with the Sonogram at the End,22,Ross finds out his ex-wife is pregnant. Rachel...,James Burrows,8.1,4888
2,1994,1,3,The One with the Thumb,22,Monica becomes irritated when everyone likes h...,James Burrows,8.2,4605
3,1994,1,4,The One with George Stephanopoulos,22,Joey and Chandler take Ross to a hockey game t...,James Burrows,8.1,4468
4,1994,1,5,The One with the East German Laundry Detergent,22,"Eager to spend time with Rachel, Ross pretends...",Pamela Fryman,8.5,4438


In [3]:
df = original_df[["Episode_Title","Summary"]]
print(df.shape)
df.head()

(236, 2)


Unnamed: 0,Episode_Title,Summary
0,The One Where Monica Gets a Roommate: The Pilot,"Monica and the gang introduce Rachel to the ""r..."
1,The One with the Sonogram at the End,Ross finds out his ex-wife is pregnant. Rachel...
2,The One with the Thumb,Monica becomes irritated when everyone likes h...
3,The One with George Stephanopoulos,Joey and Chandler take Ross to a hockey game t...
4,The One with the East German Laundry Detergent,"Eager to spend time with Rachel, Ross pretends..."


## Text Embedding

To prepare the data for embedding, we concatenate the title and summary into a single text field for each episode. To capture the semantic meaning of each episode, we use a pre-trained sentence embedding model (such as `all-MiniLM-L6-v2` from Sentence Transformers). This model converts each episode's text into a 384-dimensional vector, where similar episodes are close together in this high-dimensional space.

In [4]:
from sentence_transformers import SentenceTransformer
df['text'] = df['Episode_Title'] + ". " + df['Summary']
model = SentenceTransformer('all-MiniLM-L6-v2')  # or any suitable model
embeddings = model.encode(df['text'].tolist(), show_progress_bar=True)

  from .autonotebook import tqdm as notebook_tqdm
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['Episode_Title'] + ". " + df['Summary']
Batches: 100%|██████████| 8/8 [00:05<00:00,  1.59it/s]


In [5]:
embeddings

array([[-0.06163617, -0.00724177, -0.08605129, ...,  0.16980025,
         0.00630181, -0.00783357],
       [-0.03714563,  0.01532045, -0.04863459, ...,  0.11940778,
         0.06601115,  0.00470465],
       [-0.03073537,  0.00085173,  0.00155782, ...,  0.00575437,
        -0.0098543 ,  0.06319448],
       ...,
       [-0.0570685 ,  0.027488  , -0.00088583, ...,  0.20723753,
         0.04050706, -0.00823143],
       [-0.00528817,  0.02712562,  0.02982957, ...,  0.09810051,
         0.08294815, -0.0153671 ],
       [ 0.02464484,  0.01945987,  0.06406246, ...,  0.12363902,
         0.03055756,  0.02324346]], shape=(236, 384), dtype=float32)

## Anomaly Detection with Isolation Forest

To find the "weirdest" episodes, we use the Isolation Forest algorithm. This unsupervised model is designed to detect outliers in high-dimensional data. It assigns an anomaly score to each episode based on how isolated it is from the rest. Episodes with the lowest scores are considered the most unusual or "weird."

In [6]:
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05, random_state=42)
iso.fit(embeddings)
scores = iso.decision_function(embeddings)
df['anomaly_score'] = scores

# 4. Find weirdest episodes
weirdest = df.nsmallest(10, 'anomaly_score')[['Episode_Title', 'Summary', 'anomaly_score']]
print(weirdest)

                          Episode_Title  \
136        The One That Could Have Been   
104  The One with All the Thanksgivings   
157  The One Where They're Up All Night   
65       The One with the Hypnosis Tape   
160       The One with Joey's New Brain   
116         The One with the Ride Along   
188       The One with Joey's Interview   
135        The One That Could Have Been   
234                The Last One: Part 1   
19   The One with the Evil Orthodontist   

                                               Summary  anomaly_score  
136  The gang continue to think about how different...      -0.072809  
104  The gang remember and share with each other th...      -0.046759  
157  After the gang head up to the roof to see a pa...      -0.016059  
65   Monica goes out with a guy who turns out to be...      -0.010175  
160  Joey gets an opportunity to rejoin "Days of Ou...      -0.007837  
116  The guys go on a ride-along with Gary. Rachel ...      -0.005165  
188  Joey prepares for

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['anomaly_score'] = scores


## Dimensionality Reduction and Visualization.

Since we can't visualize 384-dimensional data directly, we use Principal Component Analysis (PCA) to reduce the embeddings to 2 dimensions. This allows us to plot all episodes on a 2D scatter plot for intuitive exploration.

We plot all episodes in the 2D PCA space using Plotly. The weirdest episodes (as determined by Isolation Forest) are highlighted with a different color or marker. Hovering over a point reveals the episode title.

In [11]:
from sklearn.decomposition import PCA
import plotly.express as px
pca = PCA(n_components=2, random_state=42)
pca_result = pca.fit_transform(embeddings)
df['pca1'] = pca_result[:, 0]
df['pca2'] = pca_result[:, 1]

weirdest_indices = weirdest.index
df['is_weird'] = df.index.isin(weirdest_indices)

# 3. Plot with Plotly
fig = px.scatter(
    df,
    x='pca1',
    y='pca2',
    color='is_weird',
    hover_name='Episode_Title',
    symbol='is_weird',  # Optional: different marker for weirdest
    title='FRIENDS Episodes: PCA Visualization (Weirdest Highlighted)'
)
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



**Important Caveat:**  
PCA is a lossy compression technique. It preserves the directions of greatest variance, but inevitably discards a lot of information. As a result, the 2D plot is only an approximation of the true relationships in the full embedding space. Some episodes that appear "normal" in 2D may be outliers in 384D, and vice versa.

## Saving Embeddings and Results

For reproducibility and future analysis, we save the computed embeddings as a NumPy file and the DataFrame (with all relevant columns) as a CSV or pickle file.

In [10]:
import numpy as np
np.save('friends_episode_embeddings.npy', embeddings)
df[["Episode_Title","Summary"]].to_csv('friends_title_summary.csv', index=False)

## Summary

- We embedded FRIENDS episode texts into high-dimensional vectors.
- Used Isolation Forest to detect the weirdest episodes.
- Visualized the results in 2D using PCA, with caveats about dimensionality reduction.
- Saved the results for future use.

This workflow demonstrates how modern NLP and unsupervised learning can be combined to explore and analyze pop culture datasets in creative ways!