# Set Up Environment

## Import Libraries

In [2]:
import pandas as pd

## Define Functions

# Import Data

In [4]:
videos_with_labelling_df = pd.read_csv('videos_with_labelling_df.csv')

In [5]:
videos_with_labelling_df.head()

Unnamed: 0,channel_id,video_id,video_title,description,tags,published,view_count,like_count,favourite_count,comment_count,duration,definition,caption,category_id,prompt,classification
0,UC8butISFwT-Wl7EV0hUK0BQ,9He4UBLyk8Y,Front End Developer Roadmap 2024,Learn what technologies you should learn first...,,2023-10-19 14:18:42.000000,507722.0,17091.0,0,493.0,729,hd,False,27,Front End Developer Roadmap 2024,Career
1,UC8butISFwT-Wl7EV0hUK0BQ,ypNKKYUJE5o,JavaScript Security Vulnerabilities Tutorial ...,Learn about 10 security vulnerabilities every ...,,2023-05-16 14:37:07.000000,62016.0,2625.0,0,71.0,1505,hd,True,27,JavaScript Security Vulnerabilities Tutorial –...,Tutorial
2,UC8butISFwT-Wl7EV0hUK0BQ,D6Xj_W4leu8,Use ChatGPT to Build a RegEx Generator – OpenA...,Learn how to build a dashboard that generates ...,,2023-03-30 13:32:31.000000,102762.0,2133.0,0,82.0,1792,hd,True,27,Use ChatGPT to Build a RegEx Generator – OpenA...,Tutorial
3,UC8butISFwT-Wl7EV0hUK0BQ,xZbU6bCZFYo,freeCodeCamp.org Curriculum Expansion: Math + ...,Support our campaign here: https://www.freecod...,,2021-02-02 19:00:57.000000,87027.0,3478.0,0,197.0,1677,hd,True,27,freeCodeCamp.org Curriculum Expansion: Math + ...,News
4,UC8butISFwT-Wl7EV0hUK0BQ,flpmSXVTqBI,Java Testing - JUnit 5 Crash Course,JUnit 5 is one of the most popular frameworks ...,,2021-01-12 15:59:45.000000,309188.0,5393.0,0,97.0,1565,hd,False,27,Java Testing - JUnit 5 Crash Course,Tutorial


In [94]:
video_classification_by_title_df = videos_with_labelling_df[['video_id', 'video_title', 'classification']].copy()

In [95]:
video_classification_by_title_df.head()

Unnamed: 0,video_id,video_title,classification
0,9He4UBLyk8Y,Front End Developer Roadmap 2024,Career
1,ypNKKYUJE5o,JavaScript Security Vulnerabilities Tutorial ...,Tutorial
2,D6Xj_W4leu8,Use ChatGPT to Build a RegEx Generator – OpenA...,Tutorial
3,xZbU6bCZFYo,freeCodeCamp.org Curriculum Expansion: Math + ...,News
4,flpmSXVTqBI,Java Testing - JUnit 5 Crash Course,Tutorial


# Data Preprocessing

## Tokenization
- Tokenization is the process of splitting the text into individual words or tokens. You can use a tokenizer to break down the video titles into their constituent words.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → ["Front-End", "Developer's", "Roadmap", "2024", ":", "A", "Comprehensive", "Guide!"]

### NLTK
NLTK (Natural Language Toolkit) is a powerful library for natural language processing in Python. It offers various tokenizers for different languages and purposes. Let's delve into NLTK's tokenizers and discuss their suitability for the task of tokenizing video titles.

Input Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!"

- **Word Tokenization**: 
  Splits the text into words based on whitespace and punctuation, but keeps contractions and hyphenated words intact. It treats the apostrophe and colon as separate tokens.
  Output: ['Front-End', 'Developer', "'s", 'Roadmap', '2024', ':', 'A', 'Comprehensive', 'Guide', '!']

- **WordPunct Tokenization**: 
  Splits the text into words and punctuation marks, treating each punctuation mark as a separate token. Contractions are split into individual tokens, and hyphenated words are split.
  Output: ['Front', '-', 'End', 'Developer', "'", 's', 'Roadmap', '2024', ':', 'A', 'Comprehensive', 'Guide', '!']

- **Regexp Tokenization**: 
  Uses a regular expression pattern (\w+) to match alphanumeric characters and underscores. It splits the text into words and numbers, removing other characters like apostrophes and punctuation marks.
  Output: ['Front', 'End', 'Developer', 's', 'Roadmap', '2024', 'A', 'Comprehensive', 'Guide']

- **Treebank Tokenization**: 
  Follows the conventions of the Penn Treebank corpus. It treats hyphenated words as single tokens and preserves punctuation marks as separate tokens.
  Output: ['Front-End', 'Developer', "'s", 'Roadmap', '2024', ':', 'A', 'Comprehensive', 'Guide', '!']

WordPunct Tokenization may be the best choice for this use case because it preserves punctuation marks, handles contractions and hyphenated words effectively, and provides flexibility in tokenization. Video titles often contain punctuation marks and informal language, making WordPunct Tokenization suitable for maintaining the integrity of the title's structure while extracting meaningful units of text.

## Lowercasing
- Convert all words in the video titles to lowercase. This ensures that words with different capitalization are treated as the same word.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → "front-end developer's roadmap 2024: a comprehensive guide!"

## Removing Punctuation
- Remove any punctuation marks from the video titles. Punctuation marks such as periods, commas, exclamation marks, etc., are typically not relevant for text classification tasks.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → "FrontEnd Developers Roadmap 2024 A Comprehensive Guide"

## Removing Stopwords
- Stopwords are common words that do not carry much semantic meaning, such as "and", "the", "is", etc. They are often removed because they can introduce noise into the data.
- You can use a predefined list of stopwords or a library like NLTK (Natural Language Toolkit) to remove stopwords from the video titles.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → "Front-End Developer's Roadmap 2024: Comprehensive Guide"
- If there are issues with certificate - try this
https://stackoverflow.com/questions/44649449/brew-installation-of-python-3-6-1-ssl-certificate-verify-failed-certificate/44649450#44649450

## Handling Special Characters
- Depending on the nature of your dataset, you may encounter special characters such as emojis, symbols, or non-alphanumeric characters. Decide whether to keep or remove these characters based on your analysis needs.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide! 😊" → "Front-End Developer's Roadmap 2024: A Comprehensive Guide!"

## Handling Numbers
- Decide how to handle numbers in the video titles. You may choose to keep them as-is, remove them, or replace them with placeholders.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → "Front-End Developer's Roadmap : A Comprehensive Guide!"

## Stemming and Lemmatization: Choosing the Right Technique

Stemming and lemmatization are essential text normalization techniques that aim to reduce words to their base or root forms. Both methods are used to enhance the efficiency of text processing and improve the performance of natural language processing (NLP) models. However, they operate differently and have distinct advantages and limitations.

### Stemming:

Stemming involves removing prefixes or suffixes from words to derive their root forms, known as stems. The goal is to map different variations of a word to the same base form, thereby reducing the dimensionality of the vocabulary. For example, the word "running" would be stemmed to "run", and "played" would be stemmed to "play". Stemming algorithms apply heuristic rules to chop off affixes, which may not always produce valid words.

### Lemmatization:

Lemmatization, on the other hand, maps words to their base or dictionary forms, known as lemmas, by considering the context and meaning of the word. Unlike stemming, lemmatization ensures that the resulting word is valid and meaningful. For example, the word "ran" would be lemmatized to "run", and "better" would be lemmatized to "good". Lemmatization relies on linguistic knowledge and requires access to a lexical resource such as WordNet to perform accurate transformations.

### Choosing the Right Technique:

The choice between stemming and lemmatization depends on the specific requirements of the NLP task and the characteristics of the dataset. Stemming is faster and less computationally intensive, making it suitable for applications where speed is crucial. However, it may produce non-dictionary words or incorrect stems in certain cases. On the other hand, lemmatization ensures the generation of valid words but is slower and requires more computational resources.

When deciding between stemming and lemmatization, consider the trade-offs between efficiency and accuracy. In many cases, lemmatization is preferred for tasks requiring precise word normalization and semantic analysis, while stemming may suffice for tasks focused on text classification or information retrieval.

Both stemming and lemmatization can be easily implemented using libraries such as NLTK or spaCy, offering flexibility and ease of integration into NLP pipelines. Choose the technique that best aligns with your goals and the characteristics of your dataset to achieve optimal results in your NLP applications.


In [96]:
import pandas as pd
import re
import nltk
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer 
# If there are issues with certificate - try this https://stackoverflow.com/questions/44649449/brew-installation-of-python-3-6-1-ssl-certificate-verify-failed-certificate/44649450#44649450

# Download NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

def preprocess_titles(df, treatments):
    # Tokenization
    tokenizer = WordPunctTokenizer()
    df['tokenized_title'] = df['video_title'].apply(tokenizer.tokenize)

    # Apply specified treatments
    for treatment in treatments:
        if treatment == 'lowercasing':
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word.lower() for word in x])
        elif treatment == 'remove_punctuation':
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word for word in x if re.match(r'^\w+$', word)])
        elif treatment == 'remove_stopwords':
            stop_words = set(stopwords.words('english'))
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word for word in x if word.lower() not in stop_words])
        elif treatment == 'remove_special_characters':
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [re.sub(r'[^a-zA-Z0-9\s]', '', word) for word in x])
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word for word in x if word])
        elif treatment == 'remove_numbers':
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [re.sub(r'\b\d+\b', '', word) for word in x])
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word for word in x if word])
        elif treatment == 'stemming':
            porter = PorterStemmer()
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [porter.stem(word) for word in x])
        elif treatment == 'lemmatization':
            lemmatizer = WordNetLemmatizer()
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

    return df['tokenized_title']

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/harrynorton/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harrynorton/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/harrynorton/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/harrynorton/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [97]:
# Specify the treatments to apply
treatments = ['lowercasing',
              'remove_punctuation',
              'remove_stopwords',
              #'remove_special_characters',
              #'remove_numbers',
              #'stemming',
              'lemmatization']

# Apply preprocessing to the DataFrame
video_classification_by_title_df['tokenized_video_title'] = preprocess_titles(video_classification_by_title_df.copy(), treatments)

video_classification_by_title_df

Unnamed: 0,video_id,video_title,classification,tokenized_video_title
0,9He4UBLyk8Y,Front End Developer Roadmap 2024,Career,"[front, end, developer, roadmap, 2024]"
1,ypNKKYUJE5o,JavaScript Security Vulnerabilities Tutorial ...,Tutorial,"[javascript, security, vulnerability, tutorial..."
2,D6Xj_W4leu8,Use ChatGPT to Build a RegEx Generator – OpenA...,Tutorial,"[use, chatgpt, build, regex, generator, openai..."
3,xZbU6bCZFYo,freeCodeCamp.org Curriculum Expansion: Math + ...,News,"[freecodecamp, org, curriculum, expansion, mat..."
4,flpmSXVTqBI,Java Testing - JUnit 5 Crash Course,Tutorial,"[java, testing, junit, 5, crash, course]"
...,...,...,...,...
2995,xtges88iZYU,Simplilearn Reviews | Career Restart After Eig...,Career,"[simplilearn, review, career, restart, eight, ..."
2996,gxKzKfWcNww,Top 10 Programming Languages And 10 Highest Pa...,Tutorial,"[top, 10, programming, language, 10, highest, ..."
2997,oB6TC529Oc0,Top 10 Technologies And 10 Highest Paying Jobs...,Tutorial,"[top, 10, technology, 10, highest, paying, job..."
2998,hVWrfVlomac,Learn Data Classes In Python In 10 Minutes | H...,Tutorial,"[learn, data, class, python, 10, minute, use, ..."


## Encoding (if necessary)
- Encode the preprocessed text data into a suitable format for further processing or analysis, such as one-hot encoding or word embeddings.

# Data Analysis and Feature Exploration

In [98]:
video_classification_by_title_df.head()

Unnamed: 0,video_id,video_title,classification,tokenized_video_title
0,9He4UBLyk8Y,Front End Developer Roadmap 2024,Career,"[front, end, developer, roadmap, 2024]"
1,ypNKKYUJE5o,JavaScript Security Vulnerabilities Tutorial ...,Tutorial,"[javascript, security, vulnerability, tutorial..."
2,D6Xj_W4leu8,Use ChatGPT to Build a RegEx Generator – OpenA...,Tutorial,"[use, chatgpt, build, regex, generator, openai..."
3,xZbU6bCZFYo,freeCodeCamp.org Curriculum Expansion: Math + ...,News,"[freecodecamp, org, curriculum, expansion, mat..."
4,flpmSXVTqBI,Java Testing - JUnit 5 Crash Course,Tutorial,"[java, testing, junit, 5, crash, course]"


In [99]:
video_classification_by_title_df['classification'].value_counts()

classification
Tutorial     1629
Career        458
Project       225
Tips          223
Challenge     123
Review        118
News          108
Interview     105
Lecture         7
Debate          4
Name: count, dtype: int64

## TF-IDF vectorization

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus). TF-IDF is commonly used for text feature extraction in machine learning and natural language processing tasks.

Here's a breakdown of TF-IDF:

1. **Term Frequency (TF)**: It measures how frequently a term (word) occurs in a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in the document. The idea is that words that occur more frequently within a document are more important for describing the content of that document.

   $$ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

2. **Inverse Document Frequency (IDF)**: It measures the importance of a term across a collection of documents (corpus). It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. The IDF value decreases as the term appears in more documents, indicating that common terms are less informative than rare terms.

   $$ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents in corpus } |D|}{\text{Number of documents containing term } t}\right) $$

3. **TF-IDF**: It combines the TF and IDF values to calculate a weighted score for each term in a document. The TF-IDF score increases with the frequency of the term in the document (TF) and decreases with the frequency of the term in the corpus (IDF).

   $$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$

In essence, TF-IDF identifies words that are unique and important to a specific document while also considering their general importance across a collection of documents. It's commonly used for tasks like document classification, information retrieval, and text mining.


In [100]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Drop any NaN values in the 'video_title' column
video_classification_by_title_df = video_classification_by_title_df.dropna(subset=['video_title'])

# Create a TfidfVectorizer instance
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the 'video_title' column
tfidf_matrix = tfidf_vectorizer.fit_transform(video_classification_by_title_df['video_title'])

# Convert the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display the TF-IDF DataFrame
tfidf_df.head()

Unnamed: 0,000,01,02,026,03,04,05,06,07,08,...,zero,zhou,zip,zod,zone,zuckerberg,करत,ᵐᵒˢᵗˡʸ,𝐂𝐎𝐃𝐄,𝐓𝐇𝐎𝐍
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [101]:
# Sum the TF-IDF scores across all documents
tfidf_sum = tfidf_df.sum()

# Sort the sums in descending order to get the most important words
most_important_words = tfidf_sum.sort_values(ascending=False)

# Display the most important words
most_important_words.head(20)


in            113.614793
data          113.386482
to            105.033789
python        103.521670
tutorial       88.412563
how            79.583534
for            72.793060
and            70.296413
the            63.687306
javascript     61.449447
with           58.369572
science        54.250774
is             53.561014
learn          48.970938
what           47.112559
of             46.376878
learning       44.818593
analyst        41.672244
you            41.642363
minutes        40.040516
dtype: float64

In [102]:
# Group tfidf_df by the 'classification' column and calculate the sum of TF-IDF scores for each category
tfidf_sum_by_category = tfidf_df.groupby(videos_with_labelling_df['classification']).sum()

# Display the top 10 most important words for each category
for category, scores in tfidf_sum_by_category.iterrows():
    print(f"Classification: {category}")
    top_10_words = scores.nlargest(10)  # Get the top 10 words with the highest total TF-IDF scores
    print(top_10_words)
    print()


Classification: Career
data         44.813518
to           31.295703
analyst      29.717354
how          24.190321
science      20.192632
become       17.782906
in           17.721425
scientist    16.833936
job          16.104703
the          16.064669
Name: Career, dtype: float64

Classification: Challenge
you           7.158587
this          5.098004
the           4.509462
will          4.088448
programmer    4.083853
are           3.933996
daily         3.781269
tried         3.702819
with          3.128325
css           3.027359
Name: Challenge, dtype: float64

Classification: Debate
neuralseek    0.573375
technology    0.565979
is            0.542158
vs            0.516848
underrated    0.506866
skill         0.488427
the           0.463013
most          0.408131
business      0.404929
threat        0.402841
Name: Debate, dtype: float64

Classification: Interview
interview                 5.755735
on                        5.090825
questions                 3.727423
prof          