# Text Study

After the initial audio study and the necessary text extraction, the next logical step will be the stduy of this type of unstructured data.

We will take a look at both the audio transcripts and the captions from the videos, in order to extract as much information as possible to predict the virality of a video.

## First libraries and variables import

In [1]:
import sys
from pathlib import Path
import os

# Get the absolute path of the folder containing the module
root_dir = Path.cwd().resolve().parent.parent

# Add the folder path to sys.path
sys.path.append(str(root_dir))

import nltk
from nltk.tokenize import word_tokenize

In [2]:
from text_utils import load_json, load_transcriptions, clean_text, plot_freq_dist, analyze_text, get_bag_of_words, perform_topic_modeling, classify_sentiment, count_sentiments
from config.variables import text_path, json_file, csv_file

In [3]:
text_folder = os.path.join(root_dir, text_path)
json_path = os.path.join(root_dir, json_file)

## Data recopilation and ordering

In [4]:
# Load the transcriptions of the videos
transcriptions = load_transcriptions(text_folder)

# Load the information from the JSON file
video_info = load_json(json_path)

In [5]:
# Create an object to store the combined information
video_data = {}

# Combine the information from transcriptions and the JSON file
for video_id, info in video_info.items():
    if video_id in transcriptions:
        general = info.get('general', '')  # Get the general information of the video or an empty string if not present
        text = general.get('text', '')  # Get the text of the video or an empty string if not present
        hashtags = general.get('hashtags', [])  # Get the hashtags of the video or an empty list if not present
        video_data[video_id] = {'transcription': transcriptions[video_id], 'text': text, 'hashtags': hashtags}

# Example of accessing the combined information for the first video
first_video = list(video_data.keys())[0]
print("Transcription:", video_data[first_video]['transcription'])
print("Text:", video_data[first_video]['text'])
print("Hashtags:", video_data[first_video]['hashtags'])

Transcription: 
Text: Confidence went 📈
Hashtags: []


In [6]:
# Counting the elements that are not empty strings in transcriptions
transcription_count = sum(1 for trans in transcriptions.values() if trans != '')
print("Number of non-empty transcriptions:", transcription_count)

# Counting the number of elements in video_data where the transcription is not ''
video_data_count_non_empty = sum(1 for video_id, data in video_data.items() if data['transcription'] != '')
print("Number of video_data elements with non-empty transcriptions:", video_data_count_non_empty)

# Counting the elements that are not empty strings in the video text
texts_count = sum(1 for video_id, data in video_data.items() if data['text'] != '')
print("Number of video_data elements with non-empty text:", texts_count)

# Counting the elements that are not empty lists in hashtags
hashtags_count = sum(1 for video_id, data in video_data.items() if data['hashtags'] != [])
print("Number of video_data elements with non-empty hashtags:", hashtags_count)


Number of non-empty transcriptions: 261
Number of video_data elements with non-empty transcriptions: 261
Number of video_data elements with non-empty text: 962
Number of video_data elements with non-empty hashtags: 853


As we can see, we have a transcription for 1 out of 4 videos aprox. Nevertheless, we have text and hashtags for almost all of the videos, so a text strudy could be deployed.

## Text Pre-Processing

### Text cleaning

In [7]:
# text cleaning of transcriptions and text
for video_id, data in video_data.items():
    if 'transcription' in data:
        video_data[video_id]['clean_transcription'] = clean_text(data['transcription'])
    if 'text' in data:
        video_data[video_id]['clean_text'] = clean_text(data['text'])

In [8]:
video_data[list(video_data.keys())[34]]

{'transcription': "is there the top 10 strongest one piece characters by the end of the show at number 10 we have Trafalgar law he can make anyone Immortal at the expense of his life number 9 we have useless kid he challenged kaido even though the odds are against him just like Luffy number 8 we have Sabo even though he's not at his Peak he can challenge fujitora and number 7 we have GARP the only man known to go to Toe with the former pirate king Goldie Rodger number 6 we have a kind of the author himself Oda said that if a kind of was the main character he'd find the one piece in a single year number five we have Shanks at the marineford where he just came in and said war is over and it really ended what else do you need to know about Shanks make sure like for part two",
 'text': '#Top10 Strongest #onepiece Characters by the end of the show. #anime #strongestcharacters #animeboy #luffy #zoro #animeedit #animeedits #animestiktok',
 'hashtags': ['top10',
  'onepiece',
  'anime',
  'str

As we can see, it has occured a tokenization, lemmatization and an elimination of hashtags, punctuation signs, special characters and stopwords for both text and transcription, bearing in mind the language of the possible text and transcription.

### Exploratory analysis

The next step will be the frequency distribution and the word cloud of the clean transcripts and texts, as well as the hashtags (probably the most useful one).

In [9]:
video_data

{'6907228749016714497': {'transcription': '',
  'text': 'Confidence went 📈',
  'hashtags': [],
  'clean_transcription': '',
  'clean_text': 'confidence went'},
 '6875468410612993286': {'transcription': '',
  'text': 'Quiet Zone... follow me on insta: joeysofo. Comment where you wanna see me blade next. Reply to @dwight_schnuute',
  'hashtags': [],
  'clean_transcription': '',
  'clean_text': 'quiet zone follow insta joeysofo comment wan na see blade next reply'},
 '6898699405898059010': {'transcription': '',
  'text': 'Iphone bend test🤗 #tiktok #viral #fyp #iphone #test #bend',
  'hashtags': ['tiktok', 'viral', 'fyp', 'iphone', 'test', 'bend'],
  'clean_transcription': '',
  'clean_text': 'iphone bend'},
 '6902819837345533186': {'transcription': '',
  'text': '',
  'hashtags': [],
  'clean_transcription': '',
  'clean_text': ''},
 '6905635666588192002': {'transcription': '',
  'text': '小技です👟✨#tiktok教室#tutorial',
  'hashtags': ['tiktok教室', 'tutorial'],
  'clean_transcription': '',
  'cl

In [10]:
# Collect clean transcripts, clean texts, and hashtags from video_info
clean_transcripts = []
clean_texts = []
hashtags = []
for info in video_data.values():
    if 'clean_transcription' in info:
        clean_transcripts.extend(word_tokenize(info['clean_transcription']))
    if 'clean_text' in info:
        clean_texts.extend(word_tokenize(info['clean_text']))
    if 'hashtags' in info:
        hashtags.extend(info['hashtags'])

In [11]:
# Plot frequency distribution and word cloud for clean transcripts
print("Analysis of Transcripts:")
plot_freq_dist(clean_transcripts)
analyze_text(' '.join(clean_transcripts))

Analysis of Transcripts:
<FreqDist with 2008 samples and 5244 outcomes>
Most common words: [('like', 84), ('know', 67), ('get', 46), ('going', 46), ('yeah', 46), ('want', 41), ('people', 40), ('one', 40), ('really', 35), ('way', 32)]


In [12]:
# Plot frequency distribution and word cloud for clean texts
print("Analysis of Captions:")
plot_freq_dist(clean_texts)
analyze_text(' '.join(clean_texts))

Analysis of Captions:
<FreqDist with 1953 samples and 2840 outcomes>
Most common words: [('reply', 59), ('one', 17), ('video', 15), ('day', 13), ('1', 12), ('know', 12), ('love', 11), ('jij', 10), ('antwoorden', 10), ('get', 10)]


In [13]:
# Plot frequency distribution and word cloud for hashtags
print("Analysis of Hashtags:")
plot_freq_dist(hashtags)
analyze_text(' '.join(hashtags))

Analysis of Hashtags:
<FreqDist with 2220 samples and 5330 outcomes>
Most common words: [('fyp', 417), ('foryou', 272), ('foryoupage', 174), ('fy', 116), ('fitness', 87), ('workout', 73), ('voorjou', 72), ('viral', 65), ('animeedit', 63), ('anime', 50)]


As we can see, it does not seem that a clear keyword is used in the transcripts or captions. Nevertheless, most tiktok videos has the hashtag ``fyp``, ``foryou``, ``foryoupage``, maybe in order to be recommended to users with similar interests. Furthermore, the most used ones are related with the exercise and anime.

Other techniques such as tf idf, post tagging, NER, topic modelling, etc. do not seem to have a clear usefulness for this use case. Likewise, an example is made to indicate its operation.

In [14]:
# Example usage with clean transcripts and clean texts
sentences = clean_transcripts + clean_texts
get_bag_of_words(sentences)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [15]:
# Example usage
topics_per_video = perform_topic_modeling(video_data)
for video_id, topics in topics_per_video.items():
    print("Topics for video", video_id)
    print(topics)

Topics for video 6907228749016714497
[(0, '1.000*"confidence went"'), (1, '1.000*"confidence went"')]
Topics for video 6875468410612993286
[(0, '1.000*"quiet zone follow insta joeysofo comment wan na see blade next reply"'), (1, '1.000*"quiet zone follow insta joeysofo comment wan na see blade next reply"')]
Topics for video 6898699405898059010
[(0, '0.143*"iphone" + 0.143*"iphone bend"'), (1, '0.143*"fyp" + 0.143*"viral"')]
Topics for video 6902819837345533186
No cleaned data available
Topics for video 6905635666588192002
[(0, '0.500*"tiktok教室" + 0.500*"tutorial"'), (1, '0.500*"tutorial" + 0.500*"tiktok教室"')]
Topics for video 6895497835681287426
[(0, '0.250*"volleyballlove" + 0.250*"great rally show love comments"'), (1, '0.250*"volleyball" + 0.250*"volleyballworld"')]
Topics for video 6895303013867539713
[(0, '1.000*"oh"'), (1, '1.000*"oh"')]
Topics for video 6884590643327290625
[(0, '1.000*"timewarpscan"'), (1, '1.000*"timewarpscan"')]
Topics for video 6906514963569888513
[(0, '0.07

It is easily observable that no large conclusions can be drawn from these exploratory and feature extraction methods. This is due to the short and incomprehensible texts of the videos and transcripts without the rest of the information.

We already knew that text was our weakest source of information, and from which we could possibly extract the least information.

### Sentiment analysis

Now, the final step of data exploration and feature extraction focuses on a basic sentiment analysis. The goal is to extract more information from the videos in order to create more accurate models, or even to consider the possibility of using models that do not rely on text, but only on hashtags and sentiments to predict their virality. Both models will be tested to observe which ones obtain more accurate results for our training and validation sets, which will be divided in the next step.

In [16]:
classify_sentiment(video_data)

In [17]:
video_data

{'6907228749016714497': {'transcription': '',
  'text': 'Confidence went 📈',
  'hashtags': [],
  'clean_transcription': '',
  'clean_text': 'confidence went',
  'text_sentiment': 'Positive',
  'transcript_sentiment': 'Unknown'},
 '6875468410612993286': {'transcription': '',
  'text': 'Quiet Zone... follow me on insta: joeysofo. Comment where you wanna see me blade next. Reply to @dwight_schnuute',
  'hashtags': [],
  'clean_transcription': '',
  'clean_text': 'quiet zone follow insta joeysofo comment wan na see blade next reply',
  'text_sentiment': 'Negative',
  'transcript_sentiment': 'Unknown'},
 '6898699405898059010': {'transcription': '',
  'text': 'Iphone bend test🤗 #tiktok #viral #fyp #iphone #test #bend',
  'hashtags': ['tiktok', 'viral', 'fyp', 'iphone', 'test', 'bend'],
  'clean_transcription': '',
  'clean_text': 'iphone bend',
  'text_sentiment': 'Neutral',
  'transcript_sentiment': 'Unknown'},
 '6902819837345533186': {'transcription': '',
  'text': '',
  'hashtags': [],
  

In [18]:
# Count the sentiments of the videos
count_sentiments(video_data)

Sentiment Counts for Text and Transcriptions (BLOB):

Sentiment  Text       Transcription
Positive   139        109       
Negative   63         45        
Neutral    798        846       
Unknown    0          0         


Sentiment Counts for Text and Transcriptions (VADER):

Sentiment  Text       Transcription
Positive   152        130       
Negative   62         51        
Neutral    786        819       
Unknown    0          0         


Sentiment Counts for Text and Transcriptions (BOTH):

Sentiment  Text       Transcription
Positive   202        136       
Negative   87         60        
Neutral    673        65        
Unknown    38         739       




As can be seen, most videos are either unlabeled or neutral, especially when looking at the transcriptions, although there appear to be more videos classified as positive sentiment than negative. Additionally, the hybrid method seems to be more effective, using both TextBlob and VADER for sentiment analysis and producing a combined response.

Finally, it's normal for the sentiments of a video's transcription and caption to differ, as captions may consist solely of hashtags or may indicate a response, among other possibilities.

### Train/Test split

The next and last step before the model creation will be the train/test split and the addition of out virality response variable, created, explained and stored in the ``project.ipynb``.

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split

csv_path = os.path.join(root_dir, csv_file)

df_virality = pd.read_csv(csv_path)

data = []

for video_id, info in video_data.items():
    virality_info = df_virality[df_virality['id'] == int(video_id)]
    if not virality_info.empty:
        combined_info = {
            'video_id': video_id,
            'transcription': info['transcription'],
            'text': info['text'],
            'hashtags': info['hashtags'],
            'norm_virality': virality_info['norm_virality'].values[0]  # Add norm_virality from the DataFrame
        }
        if combined_info['transcription']:
            combined_info['clean_transcription'] = info['clean_transcription']
            combined_info['transcript_sentiment'] = info['transcript_sentiment']
        if combined_info['text']:
            combined_info['clean_text'] = info['clean_text']
            combined_info['text_sentiment'] = info['text_sentiment']
        data.append(combined_info) if video_id in df_virality['id'].values else print(f" )
        
df = pd.DataFrame(data)

In [24]:
df_virality

Unnamed: 0,id,likes,shares,comments,views,virality,norm_virality
0,6907228749016714497,3710,50,68,44800,4.489292e+04,0.000177
1,6875468410612993286,55700,1817,936,838100,8.397287e+05,0.003337
2,6898699405898059010,936200,21100,27100,15300000,1.532623e+07,0.060929
3,6902819837345533186,12900,197,143,94900,9.522022e+04,0.000377
4,6905635666588192002,8805,198,52,115300,1.155290e+05,0.000457
...,...,...,...,...,...,...,...
995,6877191692341054721,13300,152,111,129300,1.296154e+05,0.000513
996,6908069845825359109,12200,223,321,80700,8.102722e+04,0.000320
997,6883484287434378497,26600,3392,668,449300,4.506675e+05,0.001790
998,6898721943978036481,10000,111,274,72200,7.245293e+04,0.000286


In [27]:
data

[]

In [None]:

train_df, Test_df = train_test_split(df_virality, test_size=0.2, random_state=42)

In [None]:


# Convertir a DataFrame de pandas
train_data_df = pd.DataFrame(train_data)
test_data_df = pd.DataFrame(test_data)

# Paso 4: Guardar los datos combinados en un nuevo archivo CSV
train_data_df.to_csv('train_data.csv', index=False)
test_data_df.to_csv('test_data.csv', index=False)
