# Using Naive Bayes to classify recommended TikTok videos

Classify videos as either Beauty, Sports, Society, or Other

In [1]:
import pandas as pd

In [2]:
pd.set_option('display.max_colwidth', None)

For videos with a Pyktok-provided diversification label, assign video a corresponding category (sports, beauty, society, other):

In [7]:
pyk_df = pd.read_csv('pyktok_output.csv')
def determine_category(info): #method written by Johanna
    if type(info) == str:
        if 'sports' in info.lower() or 'fitness' in info.lower() or 'outdoor' in info.lower():
            return 'sports'
        elif 'beauty' in info.lower() or 'style' in info.lower():
            return 'beauty'
        elif 'society' in info.lower() or 'news' in info.lower() or 'social issues' in info.lower():
            return 'society'
        else:
            return 'other'
    else:
        return None
    
#label category    
pyk_df['category'] = pyk_df['diversificationLabels'].apply(determine_category)
pyk_df['video_id'] = pyk_df['video_id'].astype(str)

Clean and prepare the dataset:

In [8]:
#remove rows that have neither suggested_words nor video_description
pyk_df.dropna(subset=['suggested_words', 'video_description'], how='all', inplace=True)

#merge the suggested_words and video_description columns
pyk_df['description'] = pyk_df['suggested_words'].combine_first(pyk_df['video_description'])

#lowercase and remove punctuation
pyk_df['description'] = pyk_df.description.map(lambda x: x.lower())
pyk_df['description'] = pyk_df.description.str.replace('[^\w\s]', '')

  pyk_df['description'] = pyk_df.description.str.replace('[^\w\s]', '')


Let's see how many videos of each category we have:

In [9]:
category_counts = pyk_df['category'].value_counts(dropna=False)
category_counts

other      1135
beauty      398
sports      196
society     193
None        126
Name: category, dtype: int64

As we can see above, the Pyktok data has an unequal number of videos corresponding to each category. For training our classifier, we want a balanced dataset. So let's keep about 200 random videos for each category:

In [10]:
# Filter rows corresponding to the 'beauty', 'other' categores
beauty_df = pyk_df[pyk_df['category'] == 'beauty']
other_df = pyk_df[pyk_df['category'] == 'other']

beauty_df = beauty_df.sample(n=200, random_state=1)  
other_df = other_df.sample(n=200, random_state=1)

# Filter rows corresponding to the sports or no categories
remaining_df = pyk_df[(pyk_df['category'] != 'beauty') & (pyk_df['category'] != 'other')]

# Concatenate the new dataframes of 200 videos each
balanced_df = pd.concat([beauty_df, other_df, remaining_df])

Next we'll tokenize the descriptions:

In [11]:
import nltk

In [12]:
balanced_df['description'] = balanced_df['description'].apply(nltk.word_tokenize)

For training our classifier, we only want the labeled data (videos with Pyktok diversification labels):

In [13]:
labeled_df = balanced_df[balanced_df['category'].notna()]

Use CountVectorizer to transform data into occurrences:

In [14]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

# This converts the list of words into space-separated strings
labeled_df['description'] = labeled_df['description'].apply(lambda x: ' '.join(x))

count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(labeled_df['description'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labeled_df['description'] = labeled_df['description'].apply(lambda x: ' '.join(x))


Use TF-IDF as model features instead of word counts:

In [15]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(counts)

counts = tfidf_transformer.transform(counts)

Split the data into training and testing sets:

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, labeled_df['category'], 
                                                    test_size=0.2, random_state=1)

Now we can fit the data to a Naive Bayes classifier. We use the Multinomial Naive Bayes Classifier here for text classification.

In [17]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)

We'll use cross validation to check the accuracy of the model on different subsections of the dataset:

In [18]:
from sklearn.model_selection import cross_val_score
accuracies=cross_val_score(estimator=model,X=X_train,y=y_train,cv=10)

In [19]:
print(accuracies)
print("Average accuracy:", sum(accuracies)/len(accuracies))

[0.609375   0.76190476 0.76190476 0.79365079 0.77777778 0.71428571
 0.63492063 0.63492063 0.66666667 0.6984127 ]
Average accuracy: 0.7053819444444445


These accuracies aren't spectacular, but they're not bad. It seems like our classifier can be used as a fairly good estimator for the category of each video. 

### So, let's use our newly trained Naive Bayes classifier to categorize unlabeled videos:

In [20]:
# Extract rows with NaN 'category' values
unlabeled_df = pyk_df[pyk_df['category'].isna()]
unlabeled_df.shape

(126, 23)

In [21]:
#Transform the unlabeled data into counts and then TF-IDF features:
X_unlabeled_counts = count_vectorizer.transform(unlabeled_df['description']) 
X_unlabeled_tfidf = tfidf_transformer.transform(X_unlabeled_counts)

In [22]:
#Predict categories and add to our Pyktok data
predicted_categories = model.predict(X_unlabeled_tfidf)

unlabeled_df['category'] = predicted_categories

pyk_df.update(unlabeled_df)
updated_pyk_df = pyk_df.dropna(subset=['category'])
updated_pyk_df.to_csv('categorized_pyktok.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unlabeled_df['category'] = predicted_categories
