# Plan
To classify more videos into these categories based on their titles using NLP, you can follow these steps:

1. **Data Preprocessing:**
   - Tokenize the video titles: Split each title into individual words or tokens.
   - Lowercasing: Convert all tokens to lowercase to ensure consistency.
   - Remove punctuation: Eliminate any punctuation marks from the tokens.
   - Remove stopwords: Remove common words (e.g., "the", "is", "and") that do not contribute much to the classification.
   - Stemming or Lemmatization (optional): Reduce words to their base or root form to further normalize the text data.

2. **Feature Extraction:**
   - Use techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or Bag of Words to convert the preprocessed text data into numerical feature vectors.
   - TF-IDF assigns weights to words based on their frequency in the document and inverse frequency across all documents, providing a measure of importance for each word.

3. **Model Training and Evaluation:**
   - Split the dataset into training and testing sets.
   - Train a machine learning model (e.g., Naive Bayes, Logistic Regression, Support Vector Machine, Adaboost Classifier, LSTM) using the training data and the extracted features.
   - Evaluate the trained model's performance using the testing data and metrics such as accuracy, precision, recall, and F1-score.
   
4. **Model Deployment:**
   - Once you have a satisfactory model, deploy it to classify new videos into the predefined categories based on their titles.
   - You can use the trained model to predict the classification of new video titles.

5. **Continuous Improvement:**
   - Monitor the model's performance over time and collect feedback.
   - Periodically retrain the model with updated data to improve its accuracy and effectiveness.

By following these steps, you can build an NLP-based classification system to categorize more videos into the predefined categories based on their titles.

# Set Up Environment

## Import Libraries

In [1]:
import pandas as pd

## Define Functions

# Import Data

In [5]:
videos_with_labelling_df = pd.read_csv('videos_with_labelling_df.csv')

In [6]:
videos_with_labelling_df = videos_with_labelling_df.rename(columns={'classification': 'video_type'})

In [7]:
videos_with_labelling_df.head()

Unnamed: 0,channel_id,video_id,video_title,description,tags,published,view_count,like_count,favourite_count,comment_count,duration,definition,caption,category_id,prompt,video_type
0,UC8butISFwT-Wl7EV0hUK0BQ,9He4UBLyk8Y,Front End Developer Roadmap 2024,Learn what technologies you should learn first...,,2023-10-19 14:18:42.000000,507722.0,17091.0,0,493.0,729,hd,False,27,Front End Developer Roadmap 2024,Career
1,UC8butISFwT-Wl7EV0hUK0BQ,ypNKKYUJE5o,JavaScript Security Vulnerabilities Tutorial ...,Learn about 10 security vulnerabilities every ...,,2023-05-16 14:37:07.000000,62016.0,2625.0,0,71.0,1505,hd,True,27,JavaScript Security Vulnerabilities Tutorial –...,Tutorial
2,UC8butISFwT-Wl7EV0hUK0BQ,D6Xj_W4leu8,Use ChatGPT to Build a RegEx Generator – OpenA...,Learn how to build a dashboard that generates ...,,2023-03-30 13:32:31.000000,102762.0,2133.0,0,82.0,1792,hd,True,27,Use ChatGPT to Build a RegEx Generator – OpenA...,Tutorial
3,UC8butISFwT-Wl7EV0hUK0BQ,xZbU6bCZFYo,freeCodeCamp.org Curriculum Expansion: Math + ...,Support our campaign here: https://www.freecod...,,2021-02-02 19:00:57.000000,87027.0,3478.0,0,197.0,1677,hd,True,27,freeCodeCamp.org Curriculum Expansion: Math + ...,News
4,UC8butISFwT-Wl7EV0hUK0BQ,flpmSXVTqBI,Java Testing - JUnit 5 Crash Course,JUnit 5 is one of the most popular frameworks ...,,2021-01-12 15:59:45.000000,309188.0,5393.0,0,97.0,1565,hd,False,27,Java Testing - JUnit 5 Crash Course,Tutorial


In [8]:
video_classification_by_title_df = videos_with_labelling_df[['video_id', 'video_title', 'video_type']].copy()

In [9]:
video_classification_by_title_df.head()

Unnamed: 0,video_id,video_title,video_type
0,9He4UBLyk8Y,Front End Developer Roadmap 2024,Career
1,ypNKKYUJE5o,JavaScript Security Vulnerabilities Tutorial ...,Tutorial
2,D6Xj_W4leu8,Use ChatGPT to Build a RegEx Generator – OpenA...,Tutorial
3,xZbU6bCZFYo,freeCodeCamp.org Curriculum Expansion: Math + ...,News
4,flpmSXVTqBI,Java Testing - JUnit 5 Crash Course,Tutorial


# Data Preprocessing

## Tokenization
- Tokenization is the process of splitting the text into individual words or tokens. You can use a tokenizer to break down the video titles into their constituent words.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → ["Front-End", "Developer's", "Roadmap", "2024", ":", "A", "Comprehensive", "Guide!"]

### NLTK
NLTK (Natural Language Toolkit) is a powerful library for natural language processing in Python. It offers various tokenizers for different languages and purposes. Let's delve into NLTK's tokenizers and discuss their suitability for the task of tokenizing video titles.

Input Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!"

- **Word Tokenization**: 
  Splits the text into words based on whitespace and punctuation, but keeps contractions and hyphenated words intact. It treats the apostrophe and colon as separate tokens.
  Output: ['Front-End', 'Developer', "'s", 'Roadmap', '2024', ':', 'A', 'Comprehensive', 'Guide', '!']

- **WordPunct Tokenization**: 
  Splits the text into words and punctuation marks, treating each punctuation mark as a separate token. Contractions are split into individual tokens, and hyphenated words are split.
  Output: ['Front', '-', 'End', 'Developer', "'", 's', 'Roadmap', '2024', ':', 'A', 'Comprehensive', 'Guide', '!']

- **Regexp Tokenization**: 
  Uses a regular expression pattern (\w+) to match alphanumeric characters and underscores. It splits the text into words and numbers, removing other characters like apostrophes and punctuation marks.
  Output: ['Front', 'End', 'Developer', 's', 'Roadmap', '2024', 'A', 'Comprehensive', 'Guide']

- **Treebank Tokenization**: 
  Follows the conventions of the Penn Treebank corpus. It treats hyphenated words as single tokens and preserves punctuation marks as separate tokens.
  Output: ['Front-End', 'Developer', "'s", 'Roadmap', '2024', ':', 'A', 'Comprehensive', 'Guide', '!']

WordPunct Tokenization may be the best choice for this use case because it preserves punctuation marks, handles contractions and hyphenated words effectively, and provides flexibility in tokenization. Video titles often contain punctuation marks and informal language, making WordPunct Tokenization suitable for maintaining the integrity of the title's structure while extracting meaningful units of text.

## Lowercasing
- Convert all words in the video titles to lowercase. This ensures that words with different capitalization are treated as the same word.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → "front-end developer's roadmap 2024: a comprehensive guide!"

## Removing Punctuation
- Remove any punctuation marks from the video titles. Punctuation marks such as periods, commas, exclamation marks, etc., are typically not relevant for text classification tasks.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → "FrontEnd Developers Roadmap 2024 A Comprehensive Guide"

## Removing Stopwords
- Stopwords are common words that do not carry much semantic meaning, such as "and", "the", "is", etc. They are often removed because they can introduce noise into the data.
- You can use a predefined list of stopwords or a library like NLTK (Natural Language Toolkit) to remove stopwords from the video titles.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → "Front-End Developer's Roadmap 2024: Comprehensive Guide"
- If there are issues with certificate - try this
https://stackoverflow.com/questions/44649449/brew-installation-of-python-3-6-1-ssl-certificate-verify-failed-certificate/44649450#44649450

## Handling Special Characters
- Depending on the nature of your dataset, you may encounter special characters such as emojis, symbols, or non-alphanumeric characters. Decide whether to keep or remove these characters based on your analysis needs.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide! 😊" → "Front-End Developer's Roadmap 2024: A Comprehensive Guide!"

## Handling Numbers
- Decide how to handle numbers in the video titles. You may choose to keep them as-is, remove them, or replace them with placeholders.
- Example: "Front-End Developer's Roadmap 2024: A Comprehensive Guide!" → "Front-End Developer's Roadmap : A Comprehensive Guide!"

## Stemming and Lemmatization: Choosing the Right Technique

Stemming and lemmatization are essential text normalization techniques that aim to reduce words to their base or root forms. Both methods are used to enhance the efficiency of text processing and improve the performance of natural language processing (NLP) models. However, they operate differently and have distinct advantages and limitations.

### Stemming:

Stemming involves removing prefixes or suffixes from words to derive their root forms, known as stems. The goal is to map different variations of a word to the same base form, thereby reducing the dimensionality of the vocabulary. For example, the word "running" would be stemmed to "run", and "played" would be stemmed to "play". Stemming algorithms apply heuristic rules to chop off affixes, which may not always produce valid words.

### Lemmatization:

Lemmatization, on the other hand, maps words to their base or dictionary forms, known as lemmas, by considering the context and meaning of the word. Unlike stemming, lemmatization ensures that the resulting word is valid and meaningful. For example, the word "ran" would be lemmatized to "run", and "better" would be lemmatized to "good". Lemmatization relies on linguistic knowledge and requires access to a lexical resource such as WordNet to perform accurate transformations.

### Choosing the Right Technique:

The choice between stemming and lemmatization depends on the specific requirements of the NLP task and the characteristics of the dataset. Stemming is faster and less computationally intensive, making it suitable for applications where speed is crucial. However, it may produce non-dictionary words or incorrect stems in certain cases. On the other hand, lemmatization ensures the generation of valid words but is slower and requires more computational resources.

When deciding between stemming and lemmatization, consider the trade-offs between efficiency and accuracy. In many cases, lemmatization is preferred for tasks requiring precise word normalization and semantic analysis, while stemming may suffice for tasks focused on text classification or information retrieval.

Both stemming and lemmatization can be easily implemented using libraries such as NLTK or spaCy, offering flexibility and ease of integration into NLP pipelines. Choose the technique that best aligns with your goals and the characteristics of your dataset to achieve optimal results in your NLP applications.


In [10]:
import pandas as pd
import re
import nltk
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer 

# If there are issues with certificate - try this https://stackoverflow.com/questions/44649449/brew-installation-of-python-3-6-1-ssl-certificate-verify-failed-certificate/44649450#44649450

# Download NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

def preprocess_titles(df, treatments):
    """
    Preprocesses the video titles in a DataFrame based on the specified treatments.

    Parameters:
    - df (DataFrame): The DataFrame containing the video titles.
    - treatments (list): A list of treatments to apply to the video titles. Possible treatments include:
                         'lowercasing', 'remove_punctuation', 'remove_stopwords',
                         'remove_special_characters', 'remove_numbers', 'stemming', 'lemmatization'.

    Returns:
    - Series: The preprocessed tokenized titles.
    """

    # Tokenization
    tokenizer = WordPunctTokenizer()
    df['tokenized_title'] = df['video_title'].apply(tokenizer.tokenize)

    # Apply specified treatments
    for treatment in treatments:
        if treatment == 'lowercasing':
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word.lower() for word in x])
        elif treatment == 'remove_punctuation':
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word for word in x if re.match(r'^\w+$', word)])
        elif treatment == 'remove_stopwords':
            stop_words = set(stopwords.words('english'))
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word for word in x if word.lower() not in stop_words])
        elif treatment == 'remove_special_characters':
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [re.sub(r'[^a-zA-Z0-9\s]', '', word) for word in x])
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word for word in x if word])
        elif treatment == 'remove_numbers':
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [re.sub(r'\b\d+\b', '', word) for word in x])
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [word for word in x if word])
        elif treatment == 'stemming':
            porter = PorterStemmer()
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [porter.stem(word) for word in x])
        elif treatment == 'lemmatization':
            lemmatizer = WordNetLemmatizer()
            df['tokenized_title'] = df['tokenized_title'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

    return df['tokenized_title']


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/harrynorton/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harrynorton/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/harrynorton/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/harrynorton/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [11]:
# Specify the treatments to apply
treatments = ['lowercasing',
              'remove_punctuation',
              'remove_stopwords',
              #'remove_special_characters',
              #'remove_numbers',
              #'stemming',
              'lemmatization']

# Apply preprocessing to the DataFrame
video_classification_by_title_df['tokenized_video_title'] = preprocess_titles(video_classification_by_title_df.copy(), treatments)

video_classification_by_title_df

Unnamed: 0,video_id,video_title,video_type,tokenized_video_title
0,9He4UBLyk8Y,Front End Developer Roadmap 2024,Career,"[front, end, developer, roadmap, 2024]"
1,ypNKKYUJE5o,JavaScript Security Vulnerabilities Tutorial ...,Tutorial,"[javascript, security, vulnerability, tutorial..."
2,D6Xj_W4leu8,Use ChatGPT to Build a RegEx Generator – OpenA...,Tutorial,"[use, chatgpt, build, regex, generator, openai..."
3,xZbU6bCZFYo,freeCodeCamp.org Curriculum Expansion: Math + ...,News,"[freecodecamp, org, curriculum, expansion, mat..."
4,flpmSXVTqBI,Java Testing - JUnit 5 Crash Course,Tutorial,"[java, testing, junit, 5, crash, course]"
...,...,...,...,...
31073,QmPBLroyHB0,Neural Network learns the Mandelbrot set [Part 1],,"[neural, network, learns, mandelbrot, set, par..."
31074,RO9rfa8-vwo,Life Engine Update (now with graphs! 📈),,"[life, engine, update, graph]"
31075,HpgXTphPCP0,Bugs are Features in Evolution [The Life Engine],,"[bug, feature, evolution, life, engine]"
31076,uGkkm023BSs,Building a Zoo with Evolution [The Life Engine],,"[building, zoo, evolution, life, engine]"


## Encoding (if necessary)
- Encode the preprocessed text data into a suitable format for further processing or analysis, such as one-hot encoding or word embeddings.

# Data Inspection

In [12]:
video_classification_by_title_df.head()

Unnamed: 0,video_id,video_title,video_type,tokenized_video_title
0,9He4UBLyk8Y,Front End Developer Roadmap 2024,Career,"[front, end, developer, roadmap, 2024]"
1,ypNKKYUJE5o,JavaScript Security Vulnerabilities Tutorial ...,Tutorial,"[javascript, security, vulnerability, tutorial..."
2,D6Xj_W4leu8,Use ChatGPT to Build a RegEx Generator – OpenA...,Tutorial,"[use, chatgpt, build, regex, generator, openai..."
3,xZbU6bCZFYo,freeCodeCamp.org Curriculum Expansion: Math + ...,News,"[freecodecamp, org, curriculum, expansion, mat..."
4,flpmSXVTqBI,Java Testing - JUnit 5 Crash Course,Tutorial,"[java, testing, junit, 5, crash, course]"


In [13]:
video_classification_by_title_df['video_type'].value_counts()

video_type
Tutorial     1629
Career        458
Project       225
Tips          223
Challenge     123
Review        118
News          108
Interview     105
Lecture         7
Debate          4
Name: count, dtype: int64

# Feature Creation

## TF-IDF vectorization

### What is TF-IDF vectorization

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus). TF-IDF is commonly used for text feature extraction in machine learning and natural language processing tasks.

Here's a breakdown of TF-IDF:

1. **Term Frequency (TF)**: It measures how frequently a term (word) occurs in a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in the document. The idea is that words that occur more frequently within a document are more important for describing the content of that document.

   $$ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

2. **Inverse Document Frequency (IDF)**: It measures the importance of a term across a collection of documents (corpus). It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. The IDF value decreases as the term appears in more documents, indicating that common terms are less informative than rare terms.

   $$ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents in corpus } |D|}{\text{Number of documents containing term } t}\right) $$

3. **TF-IDF**: It combines the TF and IDF values to calculate a weighted score for each term in a document. The TF-IDF score increases with the frequency of the term in the document (TF) and decreases with the frequency of the term in the corpus (IDF).

   $$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$

In essence, TF-IDF identifies words that are unique and important to a specific document while also considering their general importance across a collection of documents. It's commonly used for tasks like document classification, information retrieval, and text mining.

## sklearn TfidfVectorizer
1. **Initialize TfidfVectorizer with no preprocessing:** 
Here, we initialize a TfidfVectorizer object without specifying any preprocessing steps. By setting `preprocessor=None` and `tokenizer=None`, we indicate that we don't want any preprocessing to be applied by the vectorizer. This means that the input data will be used directly as it is without any modifications.

2. **Fit and transform the tokenized_video_title to TF-IDF vectors:** 
We apply the `fit_transform` method of the TfidfVectorizer to convert the tokenized_video_title into TF-IDF vectors. This step computes the TF-IDF values for each word in the tokenized_video_title and represents each document as a vector in the TF-IDF space.

3. **Get feature names (words):**
After fitting the TfidfVectorizer to the data, we retrieve the feature names, which correspond to the words present in the corpus. These feature names are obtained using the `get_feature_names_out()` method of the TfidfVectorizer.

4. **Get TF-IDF values for each document:**
We convert the TF-IDF matrix obtained from the fit_transform step into a NumPy array using the `toarray()` method. This array contains the TF-IDF values for each word in each document.

5. **Create a DataFrame to store TF-IDF values for each word:**
Finally, we create a DataFrame named tfidf_df to store the TF-IDF values for each word. The DataFrame has columns corresponding to the feature names (words) obtained in step 3, and each row represents a document with its corresponding TF-IDF values for each word.

In [16]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# def dummy_fun(doc):
#     return doc

# tfidf = TfidfVectorizer(
#     analyzer='word',
#     tokenizer=dummy_fun,
#     preprocessor=dummy_fun,
#     token_pattern=None)

# Initialize TfidfVectorizer with no preprocessing
tfidf_vectorizer = TfidfVectorizer(preprocessor=None, tokenizer=None)

# Fit and transform the tokenized_video_title to TF-IDF vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(video_classification_by_title_df['tokenized_video_title'])

# Get feature names (words)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Get TF-IDF values for each document
tfidf_values = tfidf_matrix.toarray()

# Create a DataFrame to store TF-IDF values for each word
tfidf_df = pd.DataFrame(tfidf_values, columns=feature_names)

tfidf_df

AttributeError: 'list' object has no attribute 'lower'

In [17]:
video_classification_by_title_df['tokenized_video_title']

0                   [front, end, developer, roadmap, 2024]
1        [javascript, security, vulnerability, tutorial...
2        [use, chatgpt, build, regex, generator, openai...
3        [freecodecamp, org, curriculum, expansion, mat...
4                 [java, testing, junit, 5, crash, course]
                               ...                        
31073    [neural, network, learns, mandelbrot, set, par...
31074                        [life, engine, update, graph]
31075              [bug, feature, evolution, life, engine]
31076             [building, zoo, evolution, life, engine]
31077                [evolution, eye, brain, life, engine]
Name: tokenized_video_title, Length: 31078, dtype: object

In [119]:
# Find the most important words (top TF-IDF words) overall
most_important_words_overall = tfidf_df.sum().nlargest(10)

# Display the most important words overall
most_important_words_overall

data          126.395587
python        112.624916
tutorial      101.351436
javascript     66.684754
science        59.910169
learn          54.663804
analyst        52.367814
learning       48.864468
beginner       44.943550
minute         42.785388
dtype: float64

In [114]:
# Group the DataFrame by 'classification' and calculate the sum of TF-IDF values for each word
grouped_tfidf = tfidf_df.groupby(video_classification_by_title_df['video_type']).sum()

# Display the most important words for each classification
for video_type, tfidf_scores in grouped_tfidf.iterrows():
    print(f"video_type: {video_type}")
    print(tfidf_scores.nlargest(10))
    print()

Classification: Career
data         51.982620
analyst      36.005160
science      22.835622
job          21.629242
scientist    20.787768
become       20.599251
engineer     14.742559
get          13.353913
developer     9.762521
career        9.468605
Name: Career, dtype: float64

Classification: Challenge
interview     4.275692
daily         3.930452
tried         3.843385
programmer    3.796204
cs            3.660054
mistake       3.215730
coding        3.124480
ai            3.123360
question      3.087153
solving       3.076888
Name: Challenge, dtype: float64

Classification: Debate
underrated    0.712834
neuralseek    0.644472
technology    0.619791
tech          0.505279
skill         0.486375
business      0.451840
better        0.442710
chatgpt       0.429539
threat        0.419043
zuckerberg    0.419043
Name: Debate, dtype: float64

Classification: Interview
interview                 8.167957
question                  5.255439
prof                      3.573410
data          

# Model Creation
The order in which you try different classifiers depends on various factors such as the size and nature of your dataset, the complexity of the classification task, and the computational resources available. Here's a suggested order to try these classifiers:

1. **Logistic Regression:**
   - Logistic Regression is a simple and efficient linear model that serves as a good baseline for classification tasks. It's fast to train and easy to interpret.

2. **Naive Bayes:**
   - Naive Bayes classifiers are probabilistic models based on Bayes' theorem with the assumption of independence between features. They are particularly effective for text classification tasks and work well with small to medium-sized datasets.

3. **Support Vector Machine (SVM):**
   - SVM is a powerful supervised learning algorithm capable of handling linear and nonlinear classification tasks. It's effective in high-dimensional spaces and is known for its robustness and flexibility.

4. **Adaboost Classifier:**
   - Adaboost (Adaptive Boosting) is an ensemble learning method that combines multiple weak classifiers to create a strong classifier. It sequentially corrects the errors of the previous model, making it particularly effective in boosting the performance of other algorithms.

5. **LSTM (Long Short-Term Memory):**
   - LSTM is a type of recurrent neural network (RNN) architecture commonly used for sequence prediction tasks, including natural language processing (NLP). It's capable of capturing long-term dependencies in sequential data, making it suitable for text classification tasks with complex patterns.

Starting with simpler models like Logistic Regression and Naive Bayes allows you to quickly establish a baseline performance and understand the data characteristics. Then, you can gradually explore more complex models like SVM, Adaboost, and LSTM to improve classification accuracy if needed. Additionally, considering the computational complexity of LSTM, it's advisable to try it last, especially if you have limited computational resources.

## Logistic Regression

Make sure to handle any missing values or preprocessing steps before merging and training the model. Additionally, consider performing feature scaling or other data transformations if necessary for better model performance.

To train a Logistic Regression model using the TF-IDF vectors from `tfidf_df` and the `classification` labels from `video_classification_by_title_df`, follow these steps:

1. **Merge DataFrames:**
   Merge `tfidf_df` with `video_classification_by_title_df` on the common index (assuming the index represents the same order of samples in both DataFrames). This will bring together the TF-IDF vectors and the corresponding classification labels.

2. **Split Data:**
   Split the merged DataFrame into features (TF-IDF vectors) and labels (classification). The features will be all columns except the `classification` column, and the labels will be the `classification` column.

3. **Train/Test Split:**
   Split the data into training and testing sets using `train_test_split` from Scikit-learn. This will allow you to train the model on one portion of the data and evaluate its performance on another portion.

4. **Initialize Logistic Regression Model:**
   Initialize a Logistic Regression model using `LogisticRegression` from Scikit-learn.

5. **Train the Model:**
   Train the Logistic Regression model using the training data. This is done by calling the `fit` method on the model object with the features and labels of the training set as arguments.

6. **Evaluate the Model:**
   Evaluate the trained model's performance using the testing data. You can use metrics like accuracy, precision, recall, F1-score, or confusion matrix to assess how well the model performs on unseen data.

7. **Tune Hyperparameters (Optional):**
   Optionally, you can tune the hyperparameters of the Logistic Regression model using techniques like grid search or random search to optimize its performance.

8. **Make Predictions (Optional):**
   If you're satisfied with the model's performance, you can use it to make predictions on new, unseen data.


In [128]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Step 1: Merge DataFrames
merged_df = pd.concat([tfidf_df, video_classification_by_title_df['classification']], axis=1)

# Step 2: Split Data
X = merged_df.drop(columns=['classification'])
y = merged_df['classification']

# Step 3: Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Initialize Logistic Regression Model
logreg_model = LogisticRegression()

# Step 5: Train the Model
logreg_model.fit(X_train, y_train)

# Step 6: Evaluate the Model
accuracy = logreg_model.score(X_test, y_test)
print("Accuracy:", accuracy)

# Step 7: Tune Hyperparameters (Optional)
# (e.g., using GridSearchCV)

# Step 8: Make Predictions (Optional)
# (e.g., logreg_model.predict(X_new))

ValueError: y should be a 1d array, got an array of shape (2400, 2) instead.

In [130]:
merged_df['classification']

Unnamed: 0,classification,classification.1
0,0.0,Career
1,0.0,Tutorial
2,0.0,Tutorial
3,0.0,News
4,0.0,Tutorial
...,...,...
2995,0.0,Career
2996,0.0,Tutorial
2997,0.0,Tutorial
2998,0.0,Tutorial
