# Problem description / Topic Modelling

We will use a unsupervised machine learning algorithm to categorize Customer Support help center articles - also known as **Topic Modelling**. The problem we are trying to solve is that searching a few hundred help center article has a better user experience when you can use tags to find articles that matter to you. This will also be used for search. 

# Step 1: Gather data


I'll create a function that will download the text from a specific help center that uses Zendesk (A SaaS). We check if the data already exists at `data/articles.csv`, if it does then we don't get the data again. 

Since each help center is related to one specific business, we will only process one companies help center instead of processing many help center. This will allow our Topic Modelling model(s) to be more accurate for each specific business.

In [2]:
import requests
import pandas as pd
import os

def fetch_articles():
    url = "https://support.amboss.com/api/v2/help_center/en-us/articles?per_page=100"
    all_articles = []

    while url:
        response = requests.get(url)
        data = response.json()
        all_articles.extend(data['articles'])
        url = data.get('next_page')

    return all_articles

def save_articles_to_csv(articles, filename):
    df = pd.DataFrame(articles)
    df.to_csv(filename, index=False)

file_path = 'data/articles.csv'

if not os.path.exists(file_path):
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    articles = fetch_articles()
    save_articles_to_csv(articles, file_path)
    print(f"Articles saved to {file_path}")
else:
    print(f"File {file_path} already exists. Skipping data collection.")


File data/articles.csv already exists. Skipping data collection.


# Step 2 - Identify an Unsupervised Learning Problem

For categorizing / labelling a relatively large amount (100+) customer support help center articles, we will use Topic Modelling (and most likely Latent Dirichlet Allocation) as the unsupervised learning algorithm because we don't have labels on the existing customer support help center articles. In addition, we are an external team, so we won't have the knowledge needed to manually label the data. Some other reasons we will use that unsupervised learning algorithm:
* It can discover latent features well within a collection of documents.
* Topic Modelling works well with unstructured text.
* It can handle new and update help center articles well - since the information on a support help center changes often, we want a scalable solution. 
* It is also flexible, and will be able to find latent features in any new informmation that is provided in the help center articles. 
* It can condense the high dimensionaility of words into a lower dimension topics, which will make it easier to analyze the overall themes.

Using a supervised machine learning algorithm is not ideal because we don't have labelled data and we want this solution to be scalable. We want to be able to handle new topics without needing to re-fit or label our data - otherwise a supervised algorithm will be considered overfit if we don't re-train the model when new topics or themes in the support help center are introduced.

# Step 3 - EDA



• Describe the factors or components that make up the dataset (The "factors" here are
called "features" in the machine learning term. These factors are often columns in the
tabulated data). For each factor, use a box-plot, scatter plot, histogram, etc., to
describe the data distribution as appropriate.
• Describe correlations between different factors of the dataset and justify your
assumption that they are correlated or not correlated. You may use numeric or
qualitative/graphical analysis for this step.
• Determine if any data needs to be transformed. For example, if you're planning on
using an SVM method for prediction, you may need to normalize or scale the data if
there is a considerable difference in the range of the data.
• Using your hypothesis, indicate if it's likely that you should transform data, such as
using a log transform or other transformation of the dataset.
• You should determine if your data has outliers or needs to be cleaned in any way.
Are there missing data values for specific factors? How will you handle the data
cleaning? Will you discard, interpolate or otherwise substitute data values?
• If you believe that specific factors will be more important than others in your
analysis, you should mention which and why. You will use this to confirm your
intuitions in your final write-up.

In [10]:
# First, we'll do some basic checking of the data
# things like column shape, column names, some of the first rows, see if there are any missing values

df = pd.read_csv(file_path)


print("Dataset shape:", df.shape)
print(f"Number of articles: {df.shape[0]}")
print("Column names:")
print(df.columns.tolist())
print("First two rows of the dataset:")
print(df[:2])
print("Missing values:")
print([(column, count) for column, count in df.isnull().sum().items() if count > 0])  # only show columns with missing values

Dataset shape: (165, 26)
Number of articles: 165
Column names:
['id', 'url', 'html_url', 'author_id', 'comments_disabled', 'draft', 'promoted', 'position', 'vote_sum', 'vote_count', 'section_id', 'created_at', 'updated_at', 'name', 'title', 'source_locale', 'locale', 'outdated', 'outdated_locales', 'edited_at', 'user_segment_id', 'permission_group_id', 'content_tag_ids', 'label_names', 'body', 'user_segment_ids']
First two rows of the dataset:
               id                                                url  \
0  23699429949841  https://amboss.zendesk.com/api/v2/help_center/...   
1  16195168657809  https://amboss.zendesk.com/api/v2/help_center/...   

                                            html_url     author_id  \
0  https://support.amboss.com/hc/en-us/articles/2...  403291139191   
1  https://support.amboss.com/hc/en-us/articles/1...  380858881411   

   comments_disabled  draft  promoted  position  vote_sum  vote_count  ...  \
0               True  False     False         

Ok cool, we did some basic EDA. Seems like a lot of columns, so the first thing we should do is try to reduce the number of columns as they will be the features of our unsupervised ML model. It also looks like the `body` column has HTML which would not be helpful for a Topic Modelling algorithm. Though it is possible that HTML with higher importance (like an H1 tag vs an H6 tag) could be more important, I think we would need to use a neural network for the model to train on the difference, and an unsupervised algorithm like LDA would not be able to work well with those nuances.

Even though we will probably end up using the LDA algorithm and use a combination of the title and the body features, we should still perform some kind of EDA in case we decide to use different algorithms that will make use of other features. 

So first EDA procedure we will be to remove columns that will provide no value: 
1. Remove unused columns (columns with nan)
2. Remove columns that are only provide an id
3. Remove columns that only have one unique value 


In [14]:
# first, let's count how many columns we are starting with
print(f"Number of columns before cleaning: {len(df.columns)}")

#drop columns with missing values or empty arrays
df = df.dropna(axis=1)

# remove columns that only provide id if they exist
id_columns = ['id', 'author_id', 'section_id', 'permission_group_id', 'content_tag_ids']
for column in id_columns:
    if column in df.columns:
        df = df.drop(column, axis=1)

# check if all comments are disable, which means the feature / column doesn't provide value 
if 'comments_disabled' in df.columns:
    all_comments_disabled = df[~df['comments_disabled'] == 'True']
    if len(all_comments_disabled) == 0:
        df = df.drop(['comments_disabled'], axis=1)
        print("All comments are disabled. Dropping column.")
    
# Now, let's check if any other columns only have the same value, then drop them too
for column in df.columns:
    if df[column].nunique() == 1:
        print(f"Column {column} has only one unique value of {df[column].iloc[0]}. Dropping it.")
        df = df.drop(column, axis=1)
        
print(f"Number of columns before cleaning: {len(df.columns)}")
print("Columns after cleaning:", df.columns.tolist())
df.head()




Number of columns before cleaning: 13
Number of columns before cleaning: 13
Columns after cleaning: ['url', 'html_url', 'promoted', 'position', 'vote_sum', 'vote_count', 'created_at', 'updated_at', 'name', 'title', 'edited_at', 'label_names', 'body']


Unnamed: 0,url,html_url,promoted,position,vote_sum,vote_count,created_at,updated_at,name,title,edited_at,label_names,body
0,https://amboss.zendesk.com/api/v2/help_center/...,https://support.amboss.com/hc/en-us/articles/2...,False,0,3,3,2024-03-28T12:12:18Z,2024-05-29T15:15:11Z,NEJM Knowledge+ and AMBOSS,NEJM Knowledge+ and AMBOSS,2024-05-29T15:15:07Z,[],"<p class=""wysiwyg-text-align-left""><img src=""h..."
1,https://amboss.zendesk.com/api/v2/help_center/...,https://support.amboss.com/hc/en-us/articles/1...,False,0,-11,33,2023-06-15T14:45:07Z,2024-05-14T18:58:27Z,🤖 Virtual AMBOSS Assistant (Beta),🤖 Virtual AMBOSS Assistant (Beta),2024-05-14T18:58:22Z,[],<p>To provide you with even better support reg...
2,https://amboss.zendesk.com/api/v2/help_center/...,https://support.amboss.com/hc/en-us/articles/1...,False,0,3,5,2023-05-31T11:35:58Z,2024-04-26T13:20:30Z,Program Overview,Program Overview,2024-04-26T13:20:23Z,[],<p>AMBOSS is accredited by the Accreditation C...
3,https://amboss.zendesk.com/api/v2/help_center/...,https://support.amboss.com/hc/en-us/articles/1...,False,0,-14,14,2023-02-01T13:49:56Z,2024-04-29T15:14:04Z,Access to Anki Mobile Support (Beta),Access to Anki Mobile Support (Beta),2023-10-05T14:50:48Z,[],<p><strong>Anki Mobile Support (Beta)</strong>...
4,https://amboss.zendesk.com/api/v2/help_center/...,https://support.amboss.com/hc/en-us/articles/1...,False,0,3,3,2023-01-26T12:59:53Z,2024-04-26T16:19:24Z,Persistent Filters,Persistent Filters,2023-10-06T13:39:25Z,[],"<div class=""p-rich_text_section"">Creating a <s..."


We removed 10 columns -> from 26 to 13, which is a great start to the EDA process. After reviewing the columns and the data now that there are less columns, it's more clear that the two columns that would provide the most value to an unsupervised algorithm that will categorize a help center article is the `title` and `body`. So let's drop all the other columns

In [18]:
columns_to_drop = [column for column in df.columns.to_list() if column not in ['title','body']]
print(f"Dropping columns: {columns_to_drop}")
df = df.drop(columns_to_drop, axis=1)
print("Columns after dropping:", df.columns.tolist())

Dropping columns: ['url', 'html_url', 'promoted', 'position', 'vote_sum', 'vote_count', 'created_at', 'updated_at', 'name', 'edited_at', 'label_names']
Columns after dropping: ['title', 'body']


Ok great - now we only have two columns to look at. We spoke before about removing HTML, and the `body` column has HTML, so let's remove it

In [19]:
from bs4 import BeautifulSoup

def strip_html(text):
    soup = BeautifulSoup(text, 'html.parser')
    return soup.get_text()

df['body'] = df['body'].apply(strip_html)

In [20]:
df.head()

Unnamed: 0,title,body
0,NEJM Knowledge+ and AMBOSS,The New England Journal of Medicine launched N...
1,🤖 Virtual AMBOSS Assistant (Beta),To provide you with even better support regard...
2,Program Overview,AMBOSS is accredited by the Accreditation Coun...
3,Access to Anki Mobile Support (Beta),Anki Mobile Support (Beta) has been rolled out...
4,Persistent Filters,Creating a Custom Qbank session allows you to ...


Now that we have chosen our important columns, let's do some data analysis on the textual data in the body and the title