# CPSC 330 - Applied Machine Learning 

## Homework 9: Communication

<br><br><br><br>

<div class="alert alert-info">
    
## Submission instructions
<hr>
rubric={points:2}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2024W1/blob/main/docs/homework_instructions.md). 

**You may work in a group on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).


When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission.
4. Make sure that the plots and output are rendered properly in your submitted file. 
5. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb.

<br><br><br><br>

<br><br><br><br>

## Exercise 1: Communication
<hr>

### 1.1 Blog post 
rubric={points:26}

Write up your analysis from hw5 or any other assignment or your side project on machine learning in a "blog post" or report format. It's fine if you just write it here in this notebook. Alternatively, you can publish your blog post publicly and include a link here. (See exercise 1.3.) The target audience for your blog post is someone like yourself right before you took this course. They don't necessarily have ML knowledge, but they have a solid foundation in technical matters. The post should focus on explaining **your results and what you did** in a way that's understandable to such a person, **not** a lesson trying to teach someone about machine learning. Again: focus on the results and why they are interesting; avoid pedagogical content.

Your post must include the following elements (not necessarily in this order):

- Description of the problem/decision.
- Description of the dataset (the raw data and/or some EDA).
- Description of the model.
- Description your results, both quantitatively and qualitatively. Make sure to refer to the original problem/decision.
- A section on caveats, describing at least 3 reasons why your results might be incorrect, misleading, overconfident, or otherwise problematic. Make reference to your specific dataset, model, approach, etc. To check that your reasons are specific enough, make sure they would not make sense, if left unchanged, to most students' submissions; for example, do not just say "overfitting" without explaining why you might be worried about overfitting in your specific case.
- At least 3 visualizations. These visualizations must be embedded/interwoven into the text, not pasted at the end. The text must refer directly to each visualization. For example "as shown below" or "the figure demonstrates" or "take a look at Figure 1", etc. It is **not** sufficient to put a visualization in without referring to it directly.

A reasonable length for your entire post would be **800 words**. The maximum allowed is **1000 words**.

#### Example blog posts

Here are some examples of applied ML blog posts that you may find useful as inspiration. The target audiences of these posts aren't necessarily the same as yours, and these posts are longer than yours, but they are well-structured and engaging. You are **not required to read these** posts as part of this assignment - they are here only as examples if you'd find that useful.

From the UBC Master of Data Science blog, written by a past student:

- https://ubc-mds.github.io/2019-07-26-predicting-customer-probabilities/

This next one uses R instead of Python, but that might be good in a way, as you can see what it's like for a reader that doesn't understand the code itself (the target audience for your post here):

- https://rpubs.com/RosieB/taylorswiftlyricanalysis

Finally, here are a couple interviews with winners from Kaggle competitions. The format isn't quite the same as a blog post, but you might find them interesting/relevant:

- https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded
- https://medium.com/kaggle-blog/winner-interview-with-shivam-bansal-data-science-for-good-challenge-city-of-los-angeles-3294c0ed1fb2


#### A note on plagiarism

You may **NOT** include text or visualizations that were not written/created by you. If you are in any doubt as to what constitutes plagiarism, please just ask. For more information see the [UBC Academic Misconduct policies](http://www.calendar.ubc.ca/vancouver/index.cfm?tree=3,54,111,959). Please don't copy this from somewhere or ask Generative AI to write it for you 🙏. 

<br><br>

Turning Meaningless Text into Meaningful Text for Machine Learning Models: Clustering and Unsupervised Learning
By Nathan Sihombing – June 2025
(Based on Homework 6 – CPSC 330: Unsupervised Learning and Clustering)

The Setup and Background
This blog post reflects on what I did and learned during Homework 6 from CPSC 330. The goal of the assignment was to practice unsupervised learning on text data. A lot of the code and structure were scaffolded for us (as a disclaimer for my use of "we"). But within that structure, there was room to reflect on what worked, what didn’t, and how different representations and models affected the outcome.
We focused on two datasets: Wikipedia sentences and recipe names. In both cases, the task was the same: try to find meaningful clusters in unlabeled text data, using a combination of sentence embeddings and different clustering algorithms. Very different from supervised models, where the data has been previously labelled. But in situations with text data, most of the time, they’ll require unsupervised learning.


The Datasets
The first dataset was a small collection of Wikipedia sentences. They were short and mixed, covering topics like history, science, politics, music, and food. The second dataset was a longer list of recipe titles, which were often very short and lacked meaningful context.
To work with the text, we used sentence embeddings. So basically, each sentence or recipe title was already turned into a 384-dimensional vector that kind of captured what it was about. Since the embeddings were given to us, we didn’t have to worry about feature engineering; we could just jump straight into clustering. Implementing and sourcing embeddings from other places is always helpful to save time when doing larger-scale models.

Manual Grouping
We started by manually grouping a short list of Wikipedia sentences into clusters just as a warm-up to get a feel for how things might be grouped. I came up with four groups: AI/ML, quantum computing, environmental science, and food. It’s very good as a programmer to manually manipulate and look through your data. If you understand your data, then you will know what to look for. 
KMeans with Bag-of-Words
We first tried KMeans using a basic bag-of-words representation. The results weren’t great as the clusters felt random and didn’t make a lot of sense, which in hindsight makes sense. BoW strips away context and order, so it didn’t capture much meaning. That ended up with nonsense clusters. 
Sentence Embeddings with KMeans
We then reran K-Means, but this time using sentence embeddings. That changed everything. Now, some clusters were clearly about military history, others were about music or science. It was the first time I saw how much difference a better representation can really make.
DBSCAN
Next, we tried DBSCAN. It was better at carving out smaller, tighter clusters and marking others as noise. It didn’t need us to pick k, which was nice, but tuning eps took some work. What was great was the inclusion of an outlier class; this would be helpful in this context to prevent outliers from skewing the clusters. 
Hierarchical Clustering
We applied hierarchical clustering using cosine distance. The dendrogram helped visualize how sentences grouped at different levels. Some links felt random, but some made sense. 

In [None]:
from IPython.display import Image

Image("img/m2.png")

This is a very helpful tool to understand where to cut off at the distance, where each data point clusters with other data points. Especially when you do Hierarchical clustering, the dendogram makes the clustering super interpretable. Honestly making it my favourite type of model
To wrap up the Wikipedia sentence clustering section, I used UMAP to project the high-dimensional embeddings into 2D. Then I visualized the clusters from different methods to see how things looked spatially.

Here’s what I saw:
Bag-of-Words + KMeans looked pretty scattered. Most of the points ended up in a single cluster, and the rest felt randomly assigned. It didn’t really separate meaningful topics, which confirmed that BoW just isn’t good enough for this kind of data.

In [None]:
from IPython.display import Image

Image("img/m1.png")


Embeddings + KMeans looked a lot better. The clusters were way more balanced, and I could already see groupings that matched themes like music, politics, or history.


Embeddings + DBSCAN picked out a few really tight clusters and labeled a bunch of points as noise (-1). That actually made sense, as some sentences were too vague or didn’t clearly belong to any topic.


Embeddings + Hierarchical gave a nice spread. It wasn’t as sharp as DBSCAN, but still formed some logical groupings. Some clusters were mixed, but overall, the separation was decent.


What stood out the most was how switching from bag-of-words to sentence embeddings completely changed the structure. Just by changing the representation, the clusters went from random to meaningful.
This visual check really helped solidify that the embedding space captured actual topic similarity and that clustering methods like DBSCAN and hierarchical could work well if the representation was good to begin with.
We looked at the shortest and longest recipe names, then generated a new set of sentence embeddings for all titles. These recipe titles were a lot shorter than the Wikipedia ones, so clustering them felt more challenging from the start. That helped me realize how vague and minimal a lot of these titles were, which probably made it harder for models to form clear clusters. 

In [None]:
from IPython.display import Image

Image("img/m4.png")

For KMeans, I tested cluster sizes from k = 2 to k = 10. I calculated both inertia (for the elbow plot) and silhouette scores to see what value of k made the most sense.


In [None]:
from IPython.display import Image

Image("img/m3.png")



The elbow plot didn't show a super clear "elbow", but the curve started to flatten around k = 10, so that seemed like a reasonable choice. The silhouette score peaked at k = 2 (~0.046) and dipped around k = 4, but it slowly increased again, ending around ~0.026 at k = 10.
Even though that wasn’t the highest score, I went with k = 10 because the clusters were more interpretable and better aligned with real themes, like desserts, soups, drinks, and meat dishes. Some examples, like chicken recipes, salads, and soups, are all separated into their own groups, which helps reinforce that the choice made sense.
In hindsight, I think higher values of k might have worked, too, since some recipe categories can be really specific. But at the same time, I didn’t want to risk over-splitting the data when k = 10 gave me solid results.

DBSCAN
I experimented with a few different values of eps and min_samples to see what would work best for DBSCAN using cosine distance. After testing, I settled on eps = 0.3 and min_samples = 5, which gave me 18 clusters along with a few noise points.
The silhouette score came out slightly negative at –0.1277, which usually means the clustering isn't very tight or well-separated. But when I looked at the clusters manually, they still made sense. Which really showed me how scores don’t always reflect the full picture, especially with short text.
Some specific groupings that DBSCAN captured well included:
Cabbage dishes


Pilafs


Cocktails


Pork medallions


Meatloaf


Because DBSCAN is great at finding dense, small clusters, it actually fit this dataset well, and a lot of recipes fall naturally into specific categories that don’t need a large cluster to be meaningful. So while the silhouette score didn’t look good on paper, the clusters themselves were sharp and well-defined, especially compared to KMeans which tended to group broader themes.
Hierarchical Clustering
For hierarchical clustering, I used the sentence embeddings and applied linkage with method="complete" and metric="cosine". I first tried pruning the dendrogram using a distance-based cutoff (t=0.97, criterion="distance"), but I found it hard to choose a good threshold this way. Small distance changes resulted in big jumps in the number of clusters, and it wasn’t very obvious where to cut based on the dendrogram’s shape.
Because of that, I switched to using maxclust=17 to directly specify the number of clusters, which was easier to tune and gave me a more stable and interpretable result. The clusters I got mostly lined up with the themes I saw in the KMeans and DBSCAN results.
Some of the dominant themes in the clusters included:
Drinks


Baked goods


Meat dishes


Seafood


Salads


Weight Watchers recipes


However, since recipe names are often short and vague, not all clusters were clean. For example, I had a few clusters that mixed together drinks and desserts, or grouped appetizers with unrelated dishes. This makes sense when working with short text; sometimes, the name alone doesn’t give the model enough to separate items perfectly.
Overall, hierarchical clustering gave me reasonably meaningful groupings and helped reinforce what I learned from the other clustering methods, especially in terms of which themes consistently showed up across approaches.
Finally, we were asked to label the clusters ourselves. This part was kind of subjective, and some clusters were easy to name (e.g., desserts), others were a mix. But it was a good reminder that interpretation still matters even in "automatic" models.


I learned to trust visuals and manual inspection just as much as numbers, especially when working with messy or ambiguous data like recipe names. But I also personally believe it’s much more applicable to say this for unsupervised learning. 

Limitations
No labels means no “accuracy”: Evaluation was entirely subjective


Short text is hard to interpret: Even embeddings have limits


UMAP is good, but misleading: Great for plots, not for distances so you have to know what you are looking at. 


Model sensitivity: Especially DBSCAN.  The smallest changes = big shifts
This assignment gave me more than practice, and it changed how I think about data analysis. It showed that unsupervised learning can reveal real structure, even in text. Most importantly, it reminded me that features matter more than models. You can have the best algorithm, but if the representation is bad, the result will be too. So, as an ML scientist, following the process is just as important as picking the correct model. 
I didn’t go into this expecting to enjoy clustering. But what I found was satisfying. Being able to find meaningful patterns in unstructured data. Watching the data sort itself out was both functional and fun to do. 


### 1.2 Effective communication technique
rubric={points:4}

Describe one effective communication technique that you used in your post, or an aspect of the post that you are particularly satisfied with. (Max 3 sentences.)

I really liked the plots I used, and I think it would help visual the ideas. Another thing I believe I did well was use communicative language that a newbie Machine Learning Student can understand this blog post. Being able to explain your thought process effeciently can be hard sometimes. 

<br><br>

### (optional, not for marks) 1.3

Publish your blog post from 1.1 publicly using a tool like [Quarto](https://quarto.org/), or somewhere like medium.com, and paste a link here. Be sure to pick a tool in which code and code output look reasonable. This link could be a useful line on your resume!

<br><br><br><br>

## Exercise 2: Your takeaway from the course 
rubric={points:2}

**Your tasks:**

- Reflect on your journey through this course. Please identify and elaborate on at least three key concepts or experiences where you had an "aha" moment. How would you use the concepts learned in this course in your personal projects or how would you approach your past projects differently based on the insights gained in this course? We encourage you to dig deep and share your genuine reflections.

> Please write thoughtful answers. We are looking forward to reading them 🙂. 

<br><br><br><br>

**Before submitting your assignment, please make sure you have followed all the instructions in the Submission instructions section at the top.** 

### Congratulations 👏👏

That's all for the assignments! Congratulations on finishing all homework assignments! 

In [None]:
from IPython.display import Image

Image("img/eva-congrats.png")