# Problem description / Topic Modelling

We will use a unsupervised machine learning algorithm to categorize Customer Support help center articles - also known as **Topic Modelling**. The problem we are trying to solve is that searching a few hundred help center article has a better user experience when you can use tags to find articles that matter to you. This will also be used for search. 

# Step 1: Gather data


I'll create a function that will download the text from a specific help center that uses Zendesk (A SaaS). We check if the data already exists at `data/articles.csv`, if it does then we don't get the data again. 

Since each help center is related to one specific business, we will only process one companies help center instead of processing many help center. This will allow our Topic Modelling model(s) to be more accurate for each specific business.

In [None]:
import requests
import pandas as pd
import os

def fetch_articles():
    url = "https://support.amboss.com/api/v2/help_center/en-us/articles?per_page=100"
    all_articles = []

    while url:
        response = requests.get(url)
        data = response.json()
        all_articles.extend(data['articles'])
        url = data.get('next_page')

    return all_articles

def save_articles_to_csv(articles, filename):
    df = pd.DataFrame(articles)
    df.to_csv(filename, index=False)

file_path = 'data/articles.csv'

if not os.path.exists(file_path):
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    articles = fetch_articles()
    save_articles_to_csv(articles, file_path)
    print(f"Articles saved to {file_path}")
else:
    print(f"File {file_path} already exists. Skipping data collection.")


# Step 2 - Identify an Unsupervised Learning Problem

For categorizing / labelling a relatively large amount (100+) customer support help center articles, we will use Topic Modelling (and most likely Latent Dirichlet Allocation) as the unsupervised learning algorithm because we don't have labels on the existing customer support help center articles. In addition, we are an external team, so we won't have the knowledge needed to manually label the data. Some other reasons we will use that unsupervised learning algorithm:
* It can discover latent features well within a collection of documents.
* Topic Modelling works well with unstructured text.
* It can handle new and update help center articles well - since the information on a support help center changes often, we want a scalable solution. 
* It is also flexible, and will be able to find latent features in any new informmation that is provided in the help center articles. 
* It can condense the high dimensionaility of words into a lower dimension topics, which will make it easier to analyze the overall themes.

Using a supervised machine learning algorithm is not ideal because we don't have labelled data and we want this solution to be scalable. We want to be able to handle new topics without needing to re-fit or label our data - otherwise a supervised algorithm will be considered overfit if we don't re-train the model when new topics or themes in the support help center are introduced.

# Step 3 - EDA
