# CPSC 330 - Applied Machine Learning 

## Homework 9: Communication

**Due date: See the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html)**

<br><br><br><br>

<div class="alert alert-info">
    
## Submission instructions
<hr>
rubric={points:2}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330-2024W1/blob/main/docs/homework_instructions.md). 

**You may work in a group on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).


When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission.
4. Make sure that the plots and output are rendered properly in your submitted file. 
5. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb.

<br><br><br><br>

## Exercise 1: Survival analysis
<hr>

rubric={points:6}

The following questions pertain to Lecture 21 on survival analysis. We'll consider the use case of customer churn analysis.

1. What is the problem with simply labeling customers are "churned" or "not churned" and using standard supervised learning techniques?
2. Consider customer A who just joined last week vs. customer B who has been with the service for a year. Who do you expect will leave the service first: probably customer A, probably customer B, or we don't have enough information to answer? Briefly explain your answer. 
3. If a customer's survival function is almost flat during a certain period, how do we interpret that?

1. The problem with simply labelling customers as "churned" or "not churned" is it may not be so useful for customer retention as it would to find when a customer is likely to churn so customer retention techniques can be applied before the churning event occurs. The problem with standard supervised learning techniques is that it requires us to know whether a customer has churned or not and when they churned and this may not be ideal for the real world in which we may not yet have much/accurate churn data. It forces us to ignore or not accurately represent the customers that are still subscribed that could potentially churn in the future and hence we're left with a very biased dataset that only contains those who churned within the timeframe of the data collection.
2. Most likely Customer A. It is likely that customers leave early on after receiving their free trials and they don't feel completely committed to the product yet. However, this can vary depending on the industry and the product as patterns in churn rates can look differently in different sectors. 
3. That means over the course of that period the customer is most likely not anymore likely to churn. 

<br><br><br><br>

## Exercise 2: Communication
<hr>

### 2.1 Blog post 
rubric={points:26}

Write up your analysis from hw5 or any other assignment or your side project on machine learning in a "blog post" or report format. It's fine if you just write it here in this notebook. Alternatively, you can publish your blog post publicly and include a link here. (See exercise 2.3.) The target audience for your blog post is someone like yourself right before you took this course. They don't necessarily have ML knowledge, but they have a solid foundation in technical matters. The post should focus on explaining **your results and what you did** in a way that's understandable to such a person, **not** a lesson trying to teach someone about machine learning. Again: focus on the results and why they are interesting; avoid pedagogical content.

Your post must include the following elements (not necessarily in this order):

- Description of the problem/decision.
- Description of the dataset (the raw data and/or some EDA).
- Description of the model.
- Description your results, both quantitatively and qualitatively. Make sure to refer to the original problem/decision.
- A section on caveats, describing at least 3 reasons why your results might be incorrect, misleading, overconfident, or otherwise problematic. Make reference to your specific dataset, model, approach, etc. To check that your reasons are specific enough, make sure they would not make sense, if left unchanged, to most students' submissions; for example, do not just say "overfitting" without explaining why you might be worried about overfitting in your specific case.
- At least 3 visualizations. These visualizations must be embedded/interwoven into the text, not pasted at the end. The text must refer directly to each visualization. For example "as shown below" or "the figure demonstrates" or "take a look at Figure 1", etc. It is **not** sufficient to put a visualization in without referring to it directly.

A reasonable length for your entire post would be **800 words**. The maximum allowed is **1000 words**.

#### Example blog posts

Here are some examples of applied ML blog posts that you may find useful as inspiration. The target audiences of these posts aren't necessarily the same as yours, and these posts are longer than yours, but they are well-structured and engaging. You are **not required to read these** posts as part of this assignment - they are here only as examples if you'd find that useful.

From the UBC Master of Data Science blog, written by a past student:

- https://ubc-mds.github.io/2019-07-26-predicting-customer-probabilities/

This next one uses R instead of Python, but that might be good in a way, as you can see what it's like for a reader that doesn't understand the code itself (the target audience for your post here):

- https://rpubs.com/RosieB/taylorswiftlyricanalysis

Finally, here are a couple interviews with winners from Kaggle competitions. The format isn't quite the same as a blog post, but you might find them interesting/relevant:

- https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded
- https://medium.com/kaggle-blog/winner-interview-with-shivam-bansal-data-science-for-good-challenge-city-of-los-angeles-3294c0ed1fb2


#### A note on plagiarism

You may **NOT** include text or visualizations that were not written/created by you. If you are in any doubt as to what constitutes plagiarism, please just ask. For more information see the [UBC Academic Misconduct policies](http://www.calendar.ubc.ca/vancouver/index.cfm?tree=3,54,111,959). Please don't copy this from somewhere or ask Generative AI to write it for you 🙏. 

# Biases in Machine Learning  

In the field of natural language processing (NLP), it is very common to use pre-trained embeddings as they have been shown to perform well on many text classification tasks. During my applied machine learning class at UBC, I had the opportunity to explore how these pre-trained embeddings integrate into NLP systems and examine the biases that inevitably become embedded within the models. In this post, I will discuss findings from Assignment 7 of the course, where I analyzed biases in the GloVe model trained on Wikipedia text.  

The GloVe algorithm, similar to Word2Vec, creates vector representations for words based on semantic similarity, which is influenced by word co-occurrence and proximity in the training data. These embeddings capture nuanced relationships between words but also reflect the stereotypes and biases inherent in the text.  

---

## **Analysis and Results**  
An anology function was created using the GloVe model to analyze when given a word and its association, what kind of word association it would come up with for another given word that draws a parallel to the first association. For the first example, I fed the analogy function the words "man", "doctor", and "woman". Meaning that, given the association that a man is associated with the word "doctor", what word would be associated with a woman in the same way that a man is associated with a doctor. <br>
<br>
The results are shown in the figure below: <br>
<img src=https://raw.githubusercontent.com/lsheresa/hw9/main/img/im1.png width="200" height="200">

<br>
<br>
As can be seen from the results, the word with the highest association was "nurse" with a score of 0.773523, which was quite significantly higher than the word with the second highest association, "physician", which had a score of 0.718943. So even though the word "physician" is technically more similar to the word "doctor" than the word "nurse" as all physicians are doctors but no nurse would be considered a doctor, the model associated the word "nurse" being significantly more similar to the word "woman", which indeed comfirms a bias in the model. <br>
<br>
Next, I fed the anology function the words "doctor", "intelligent", and "nurse" to examine further the perceptions the model has on women, which it just highly associated with the word "nurse" when asked to draw a parallel between "man" and doctor". 
<br>
<br>
The results can be seen in the figure below: <br>
<img src=https://raw.githubusercontent.com/lsheresa/hw9/main/img/im3.png width="300" height="300"> <br>
<br>
As can be seen from the results, all the anology words for "nurse", although had much lower scores this time, had nothing to do with intelligence when we associated "intelligent" with "doctor". The parallels that were drawn by the model, associated "nurse" with words that spoke more about work ethic and personality. This highlights biases in broader society that will view people with jobs that are more likely to be taken up by women, as "hard workers" or see positive attributes associated with their personalities before viewing them as intelligent. 
<br>
<br>

Lastly, I fed the anology function the words "man", "smart", and "woman" to further examine how the model will link words to women, when asked to draw a parallel between the association "man" and "smart". 
<br>
<br>
The results can be seen in the figure below: <br>
<img src=https://raw.githubusercontent.com/lsheresa/hw9/main/img/im2.png width="200" height="200">
<br>
As can be seen from the results, although the word with the highest score was "intelligent", having a score of 0.654885, which has a pretty equivalent meaning to "smart", many of the anology words refer to a person's physical appearance. The words "sexy", "cute", "pretty", and "attractive" all showed up which have no link in reality to intelligence. This example again demonstrates biases within broader society that speak to the focus people will always have on attributes unrelated to intelligence when trying to focus on a woman's intelligence.
<br>
<br>
### **Biases in Pre-Trained Embeddings**  
Pre-trained embeddings, by design, mirror the patterns in their training text. This can be beneficial for capturing linguistic subtleties but also means that biases, such as gender stereotypes, are integrated into the model.  

For instance, during the analysis, I observed that certain professions were aligned with specific genders in the embedding space. Using the vector difference between words like "man" and "woman," I identified a "gender axis" along which other terms could be projected. Professions such as "nurse" leaned heavily towards "woman," while "doctor" skewed towards "man." This pattern was quantified and visualized to highlight how these biases manifested.

### **Limitations and Challenges**  
There are several reasons why the results from this analysis could be misleading or problematic:  
1. **Confirmation Bias:**  
   Before working with the embeddings, I was already aware of examples demonstrating biases in similar models, as discussed in course lectures. This prior knowledge may have led me to subconsciously seek examples of bias that aligned with my expectations, possibly ignoring counterexamples.  

2. **Limited Dataset Scope:**  
   The embeddings used for this assignment were trained on a subset of Wikipedia text. While Wikipedia is a broad knowledge source, the specific subset might not accurately reflect biases present in the full corpus or in other text corpora. This raises concerns about the generalizability of the observed biases.  

3. **Algorithmic Influence:**  
   The GloVe algorithm works by creating a word co-occurrence matrix to capture global co-occurrence statistics. Depending on the text's structure, certain words may co-occur more frequently by chance or necessity, leading the algorithm to overemphasize relationships that fuel biases. This effect is exacerbated when the text contains systematic patterns of bias.  

---

## **Conclusion**  
Understanding and addressing biases in pre-trained embeddings is crucial for creating fair and equitable NLP systems. While embeddings like GloVe provide significant advantages in terms of semantic representation, they also amplify stereotypes present in the data. By critically evaluating and mitigating these biases, we can build more robust and inclusive machine learning systems.


### 2.2 Effective communication technique
rubric={points:4}

Describe one effective communication technique that you used in your post, or an aspect of the post that you are particularly satisfied with. (Max 3 sentences.)

One aspect of the post that I was particularly satisfied with was my ability to explain the problem and my results in a way that would be easy to understand to my target audience. I think the flow of the blog post is very good and easy to follow which makes it a breeze to read. 

<br><br>

### (optional, not for marks) 2.3

Publish your blog post from 1.1 publicly using a tool like [Quarto](https://quarto.org/), or somewhere like medium.com, and paste a link here. Be sure to pick a tool in which code and code output look reasonable. This link could be a useful line on your resume!

<br><br><br><br>

## Exercise 3: Your takeaway from the course 
rubric={points:2}

**Your tasks:**

- Reflect on your journey through this course. Please identify and elaborate on at least three key concepts or experiences where you had an "aha" moment. How would you use the concepts learned in this course in your personal projects or how would you approach your past projects differently based on the insights gained in this course? We encourage you to dig deep and share your genuine reflections.

> Please write thoughtful answers. We are looking forward to reading them 🙂. 



Reflecting on my journey through this course, I've encountered several "aha" moments that significantly shaped my understanding of machine learning and data science. Here are three key concepts or experiences and how they influence my approach to projects:

---

## **1. Hyperparameter Optimization and Overfitting**  
### **Aha Moment:**  
The lecture on hyperparameter optimization helped me deeply grasp the pitfalls of overfitting during validation. Specifically, the example of mistakenly over-optimizing on a validation set instead of keeping it as a true proxy for unseen data clarified the importance of test sets and cross-validation strategies.  

### **Application:**  
- **In personal projects:**  
  I now use techniques like nested cross-validation for hyperparameter tuning to ensure unbiased evaluation.  
- **In past projects:**  
  For example, while working on classification tasks, I often over-relied on validation accuracy. I would now revisit those projects with a stricter separation of validation and test data to avoid optimistic biases.

---

## **2. Feature Engineering and Model Interpretation**  
### **Aha Moment:**  
The discussions on feature importance and engineering made me realize how much interpretability impacts model success and decision-making. Understanding techniques like SHAP values and how to engineer domain-specific features gave me new ways to uncover hidden patterns in data.  

### **Application:**  
- **In personal projects:**  
  I’ll leverage tools like SHAP to explain predictions in projects involving sensitive domains, like healthcare.  
- **In past projects:**  
  While working on the Adult Census dataset, I could have engineered and used feature importance to refine my pipeline.  

---

## **3. Evaluation Metrics for Classification**  
### **Aha Moment:**  
The lecture on classification metrics (precision, recall, F1-score, ROC-AUC) illuminated the trade-offs between these metrics and their contextual relevance. For instance, the difference between optimizing accuracy and F1-score depending on imbalanced datasets was enlightening.  

### **Application:**  
- **In personal projects:**  
  For example, when analyzing Donald Trump’s tweets, I’ll focus on metrics like precision and recall (instead of accuracy) to evaluate sentiment classification.  
- **In past projects:**  
  While evaluating a spam detection model, I naively optimized for accuracy. I would now consider precision-recall curves to better reflect real-world performance.

---

  
  **Exploring New Techniques:**  
   Topics like time series and survival analysis introduced me to areas I haven't explored deeply, motivating future experimentation.  

---

Overall, this course has not only enriched my theoretical knowledge but also equipped me with practical tools to approach machine learning projects with greater confidence, precision, and responsibility.


**Before submitting your assignment, please make sure you have followed all the instructions in the Submission instructions section at the top.** 

### Congratulations 👏👏

That's all for the assignments! Congratulations on finishing all homework assignments! 

In [None]:
from IPython.display import Image

