<a href="https://colab.research.google.com/github/mlsafonseca/AzureAI102Files_Python/blob/main/Farcaster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CONSENSYS MARKET RESEARCH ANALYST TECHNICAL EXERCISE**
**Author:** Maria de Lurdes Fonseca

**Date:** 07.08.2024

**Version:** V1.1

**Summary:**

*   Step 0: Setting up the environment
*   Step 1: KOL Scores with Details (Without Sentiment)
*   Step 2: Sentiment Analysis and KOL Score With Sentiment
*   Step 3: Casts from TOP20 KOLs - Last 30 Days
*   Step 4: Recasts from TOP20 KOLs - Last 30 Days
*   Step 5: Casts From TOP20 KOLs - Last 180 Days
*   Step 6: Topic Evaluation With GEN AI (Pretrained Transformers)

# **STEP 0: Setting up the environment**

The next block of code is intended to support external model deployment, if needed.

In [None]:
# Installing the 'virtualenv' package
!pip install virtualenv

# Creating a Python 3 virtual environment named 'myvenv'
!virtualenv -p python3 myvenv

# Activating the created virtual environment
!source myvenv/bin/activate

# Installing needed packages (if needed)
!pip install dune-client pandas nltk scikit-learn gensim transformers

# Using the 'pip freeze' command to generate the 'requirements.txt' file within 'myvenv'
!pip freeze > myvenv/requirements.txt

# Checking the directory
!ls myvenv/

# Checking the requirements file
!cat myvenv/requirements.txt

Collecting virtualenv
  Downloading virtualenv-20.26.3-py3-none-any.whl.metadata (4.5 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
  Downloading distlib-0.3.8-py2.py3-none-any.whl.metadata (5.1 kB)
Downloading virtualenv-20.26.3-py3-none-any.whl (5.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.7/5.7 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading distlib-0.3.8-py2.py3-none-any.whl (468 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.9/468.9 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: distlib, virtualenv
Successfully installed distlib-0.3.8 virtualenv-20.26.3
created virtual environment CPython3.10.12.final.0-64 in 1393ms
  creator CPython3Posix(dest=/content/myvenv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==24

In [None]:
# Importing required libraries
from dune_client.client import DuneClient
import pandas as pd
import re
from google.colab import files
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers import BartForConditionalGeneration, BartTokenizer
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [None]:
# Initializizing Dune Client with API key
api_key = "OhrNWPdBJz8wX7bRFvEcBaxxlhmM10kX"
dune = DuneClient(api_key)

## **STEP 1: KOL Scores with Details (Without Sentiment)**

In [None]:
# Fetching the latest result of the query
query_id = 3962836
query_result = dune.get_latest_result(query_id)

# Accessing the rows attribute to get the data
result_data = None

# Checking if 'rows' contains data
if hasattr(query_result.result, 'rows'):
    result_data = query_result.result.rows
else:
    print("No 'rows' attribute found in the result.")

# Converting result_data to DataFrame
if result_data:
    # Checking if result_data is directly a list of dictionaries
    if isinstance(result_data, list) and len(result_data) > 0 and isinstance(result_data[0], dict):
        kol_scores = pd.DataFrame(result_data)
    elif isinstance(result_data, dict) and 'data' in result_data:
        kol_scores = pd.DataFrame(result_data['data'])
    else:
        print("Unsupported data format:", type(result_data))
else:
    print("No data available or unable to access data.")

# Defining the desired column order in a single line
desired_column_order = ['fid', 'kol_score', 'followers_from_top_300', 'kol_score_final', 'consensys_mentions', 'metamask_mentions', 'infura_mentions', 'linea_mentions', 'ethereum_mentions', 'mentions_to_products', 'avatar_url', 'display_name', 'profile_bio', 'user_url', 'preferred_fname', 'fid_created_at', 'weeks_since_fid_creation', 'number_of_followers', 'number_of_casts', 'number_of_recasts', 'number_of_replies', 'number_of_mentions', 'number_of_likes', 'days_since_pb', 'post_virality_score']

# Checking if all desired columns are present in the DataFrame
if 'kol_scores' in locals():
    missing_columns = set(desired_column_order) - set(kol_scores.columns)
    if missing_columns:
        print(f"Warning: The following columns are missing from the DataFrame: {missing_columns}")

    # Reordering the DataFrame columns
    kol_scores = kol_scores[desired_column_order]

    # Sorting the DataFrame based on 'kol_score_final' in descending order
    kol_scores = kol_scores.sort_values(by='kol_score_final', ascending=False)

    # Displaying the first 10 rows of the sorted DataFrame
    display(kol_scores.head(10))

    # Save to CSV and download
    kol_scores.to_csv('kol_score_sorted.csv', index=False)
    files.download('kol_score_sorted.csv')
else:
    print("Failed to create DataFrame.")

Unnamed: 0,fid,kol_score,followers_from_top_300,kol_score_final,consensys_mentions,metamask_mentions,infura_mentions,linea_mentions,ethereum_mentions,mentions_to_products,...,fid_created_at,weeks_since_fid_creation,number_of_followers,number_of_casts,number_of_recasts,number_of_replies,number_of_mentions,number_of_likes,days_since_pb,post_virality_score
0,3,0.218382,290,0.129338,3,27,0,27,267,54,...,2023-09-05 23:07:28,47,490763,187679,130719,23655,3528,967478,61,1.975425
1,5650,0.088805,271,0.09888,0,2,0,0,82,2,...,2023-09-05 23:13:52,47,382639,44590,91038,1140,21,427499,61,2.565411
2,99,0.077765,280,0.097776,0,3,0,2,35,5,...,2023-09-05 23:07:42,47,453896,84649,84964,4914,1907,493595,61,2.246485
3,2,0.04497,273,0.094497,0,4,0,6,36,10,...,2023-09-05 23:07:28,47,417823,31540,33352,3164,1614,200520,61,1.299094
4,12,0.02139,257,0.092139,0,1,0,1,57,2,...,2023-09-05 23:07:33,47,399625,22326,17417,4286,2086,144918,61,0.617924
5,8,0.018963,257,0.091896,0,1,0,2,10,3,...,2023-09-05 23:07:32,47,443882,20859,28139,1122,345,164673,61,0.547812
6,239,0.018301,257,0.09183,1,2,0,4,7,6,...,2023-09-05 23:07:55,47,379604,41817,27862,4252,3316,219594,61,0.528694
7,576,0.014442,258,0.091444,0,2,2,278,12,282,...,2023-09-05 23:08:40,47,418916,42274,36028,6164,1709,285712,61,0.417206
8,129,0.003761,261,0.090376,0,5,0,3,18,8,...,2023-09-05 23:07:45,47,443888,32972,21763,2728,1214,185557,61,0.108652
9,207,0.002867,256,0.090287,2,0,0,6,84,6,...,2023-09-05 23:07:53,47,445120,3520,4132,362,98,27943,61,0.082821


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Producing a basic description of the dataset and respective variables
print("\n\033[1m\033[30mColumns in the DataFrame:\033[0m\n")
print(kol_scores.columns)
print("\n\033[1m\033[30mCases and columns in the DataFrame:\033[0m\n")
print (kol_scores.shape)
print("\n\033[1m\033[30mColumn datatypes list:\033[0m\n")
kol_scores.info()
print(kol_scores.isna().sum())

# Printing main descriptive statistics
print("\n\033[1m\033[30mVariable main descriptives:\033[0m\n")
kol_scores.describe()


[1m[30mColumns in the DataFrame:[0m

Index(['fid', 'kol_score', 'followers_from_top_300', 'kol_score_final',
       'consensys_mentions', 'metamask_mentions', 'infura_mentions',
       'linea_mentions', 'ethereum_mentions', 'mentions_to_products',
       'avatar_url', 'display_name', 'profile_bio', 'user_url',
       'preferred_fname', 'fid_created_at', 'weeks_since_fid_creation',
       'number_of_followers', 'number_of_casts', 'number_of_recasts',
       'number_of_replies', 'number_of_mentions', 'number_of_likes',
       'days_since_pb', 'post_virality_score'],
      dtype='object')

[1m[30mCases and columns in the DataFrame:[0m

(300, 25)

[1m[30mColumn datatypes list:[0m

<class 'pandas.core.frame.DataFrame'>
Index: 300 entries, 0 to 299
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   fid                       300 non-null    int64  
 1   kol_score                 300 non-

Unnamed: 0,fid,kol_score,followers_from_top_300,kol_score_final,consensys_mentions,metamask_mentions,infura_mentions,linea_mentions,ethereum_mentions,mentions_to_products,weeks_since_fid_creation,number_of_followers,number_of_casts,number_of_recasts,number_of_replies,number_of_mentions,number_of_likes,days_since_pb,post_virality_score
count,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0
mean,159485.39,0.008916,61.53,0.0215,0.123333,2.673333,0.073333,2.543333,9.16,5.29,36.203333,61076.513333,18955.866667,11712.393333,7451.923333,1871.99,58753.656667,59.553333,0.103837
std,190679.600093,0.031804,82.729942,0.028811,0.624035,7.768814,0.456783,16.831094,25.070491,19.05719,12.808154,115792.214973,41670.053563,28724.190467,17088.077603,7642.415586,117938.28124,6.485421,0.35946
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,380.0,199.0,85.0,5.0,1.0,731.0,4.0,0.0
25%,2724.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,1763.0,1252.0,459.5,545.5,72.5,4121.5,61.0,0.0
50%,14911.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,47.0,5182.5,4208.0,2017.0,1899.5,250.0,16307.0,61.0,0.0
75%,307228.0,0.002085,115.25,0.0401,0.0,2.0,0.0,2.0,7.0,4.0,47.0,57361.25,19433.25,8678.0,6164.0,841.75,58590.5,61.0,0.037605
max,784003.0,0.311145,290.0,0.129338,6.0,78.0,5.0,278.0,267.0,282.0,47.0,490763.0,402236.0,238989.0,188958.0,77667.0,967478.0,61.0,2.888818


# **STEP 2: Sentiment Analysis and KOL Score With Sentiment**

In [None]:
# Fetching the latest result of the query
query_id = 3962782 # "Top300_KOLs_Casts and Recasts with Keywords - Brand Affinity"
query_result = dune.get_latest_result(query_id)

# Accessing the rows attribute to get the data
result_data = None

# Checking if 'rows' contains data
if hasattr(query_result.result, 'rows'):
    result_data = query_result.result.rows
else:
    print("No 'rows' attribute found in the result.")

# Converting result_data to DataFrame
if result_data:
    # Checking if result_data is directly a list of dictionaries
    if isinstance(result_data, list) and len(result_data) > 0 and isinstance(result_data[0], dict):
        casts_recasts_KOLTOP300 = pd.DataFrame(result_data)
    elif isinstance(result_data, dict) and 'data' in result_data:
        casts_recasts_KOLTOP300 = pd.DataFrame(result_data['data'])
    else:
        print("Unsupported data format:", type(result_data))
else:
    print("No data available or unable to access data.")

# Defining the desired column order in a single line
desired_column_order1 = ['created_at', 'days_since_creation', 'fid', 'hash', 'parent_hash', 'parent_fid', 'parent_url', 'text', 'embeds']

# Checking if all desired columns are present in the DataFrame
if 'casts_recasts_KOLTOP300' in locals():
    missing_columns = set(desired_column_order1) - set(casts_recasts_KOLTOP300.columns)
    if missing_columns:
        print(f"Warning: The following columns are missing from the DataFrame: {missing_columns}")

    # Reordering the DataFrame columns
    casts_recasts_KOLTOP300 = casts_recasts_KOLTOP300[desired_column_order1]

    # Displaying the first 10 rows of the sorted DataFrame
    display(casts_recasts_KOLTOP300.head(3))
else:
    print("Failed to create DataFrame.")

Unnamed: 0,created_at,days_since_creation,fid,hash,parent_hash,parent_fid,parent_url,text,embeds
0,2024-08-02 13:49:34,3,13505,0x9cc8ced7215b345cc70fc6dba1c4c292e17c81f4,0x83a95b7ff4769e36960b55615f7733e975fc2d35,19328.0,,well.. could you execute a couple of ethereum ...,[]
1,2024-08-05 15:01:37,0,7479,0x636afb798b03f5cbde9b4d075cb68816bcf3a8f6,,,https://ethereum.org,"Tradfi: 9-5, weekdays, Americans only (unless ...","[{""castId"": {""fid"": 6394, ""hash"": {""data"": [22..."
2,2024-08-05 15:11:30,0,7479,0xe9ee6612570afdcbfc372ccf1451fb40da00db27,0xb0ef575496861c5374c77fd9f981a49ba97bb4f1,773618.0,,Yup it just takes time for hype to translate i...,"[{""url"": ""https://www.eigenlayer.xyz/ecosystem..."


In [None]:
# Producing a basic description of the dataset and respective variables
print("\n\033[1m\033[30mColumns in the DataFrame:\033[0m\n")
print(casts_recasts_KOLTOP300.columns)
print("\n\033[1m\033[30mCases and columns in the DataFrame:\033[0m\n")
print (casts_recasts_KOLTOP300.shape)
print("\n\033[1m\033[30mColumn datatypes list:\033[0m\n")
casts_recasts_KOLTOP300.info()
print("\n\033[1m\033[30mMissing values:\033[0m\n")
print(casts_recasts_KOLTOP300.isna().sum())

# Printing main descriptive statistics
print("\n\033[1m\033[30mVariable main descriptives:\033[0m\n")
casts_recasts_KOLTOP300.describe()


[1m[30mColumns in the DataFrame:[0m

Index(['created_at', 'days_since_creation', 'fid', 'hash', 'parent_hash',
       'parent_fid', 'parent_url', 'text', 'embeds'],
      dtype='object')

[1m[30mCases and columns in the DataFrame:[0m

(2962, 9)

[1m[30mColumn datatypes list:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2962 entries, 0 to 2961
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   created_at           2962 non-null   object 
 1   days_since_creation  2962 non-null   int64  
 2   fid                  2962 non-null   int64  
 3   hash                 2962 non-null   object 
 4   parent_hash          1645 non-null   object 
 5   parent_fid           1645 non-null   float64
 6   parent_url           1060 non-null   object 
 7   text                 2962 non-null   object 
 8   embeds               2962 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 

Unnamed: 0,days_since_creation,fid,parent_fid
count,2962.0,2962.0,1645.0
mean,107.3842,79074.671506,112361.13617
std,61.952343,140856.87765,161099.870235
min,0.0,2.0,2.0
25%,56.0,576.0,2341.0
50%,112.0,6596.0,9134.0
75%,162.0,20388.0,236391.0
max,217.0,784003.0,816934.0


In [None]:
# Fetching the latest result of the query
query_id = 3971126 # "Recasts_KOLTOP40_20days"
query_result = dune.get_latest_result(query_id)

# Accessing the rows attribute to get the data
result_data = None

# Checking if 'rows' contains data
if hasattr(query_result.result, 'rows'):
    result_data = query_result.result.rows
else:
    print("No 'rows' attribute found in the result.")

# Converting result_data to DataFrame
if result_data:
    # Checking if result_data is directly a list of dictionaries
    if isinstance(result_data, list) and len(result_data) > 0 and isinstance(result_data[0], dict):
        recasts_KOLTOP40_20days = pd.DataFrame(result_data)
    elif isinstance(result_data, dict) and 'data' in result_data:
        recasts_KOLTOP40_20days = pd.DataFrame(result_data['data'])
    else:
        print("Unsupported data format:", type(result_data))
else:
    print("No data available or unable to access data.")

# Displaying the DataFrame
if 'recasts_KOLTOP40_20days' in locals():
    print(recasts_KOLTOP40_20days.head())
else:
    print("Failed to create DataFrame.")

            created_at                                             embeds  \
0  2024-08-08 11:00:46  [{"url": "https://stream.warpcast.com/v1/video...   
1  2024-08-08 10:12:15                                                 []   
2  2024-08-08 10:06:20                                                 []   
3  2024-08-08 10:37:36                                                 []   
4  2024-08-08 10:47:50                                                 []   

      fid                                        hash  parent_fid  \
0    4282  0x2701823480541ab0baed686e0bca8099364aa879      431629   
1  247143  0xf22bd60310894ca0f15f8ad875701cd7915e7bc1      261625   
2  247143  0x35c8ead5168772b62137b63583a94c52cc6c1229      403090   
3  268455  0xc5a8cc3727f3f36d8b27b37632b282685f408ec6      535238   
4     274  0x059e89fb5ea872504726cc39ce41726bef2f4580      435085   

                                  parent_hash parent_url  \
0  0x658a78110856a1ec7590e09724ec58505e8301a1       None   
1 

In [None]:
# Producing a basic description of the dataset and respective variables
print("\n\033[1m\033[30mColumns in the DataFrame:\033[0m\n")
print(recasts_KOLTOP40_20days.columns)
print("\n\033[1m\033[30mCases and columns in the DataFrame:\033[0m\n")
print (recasts_KOLTOP40_20days.shape)
print("\n\033[1m\033[30mColumn datatypes list:\033[0m\n")
recasts_KOLTOP40_20days.info()
print("\n\033[1m\033[30mMissing values:\033[0m\n")
print(recasts_KOLTOP40_20days.isna().sum())

# Printing main descriptive statistics
print("\n\033[1m\033[30mVariable main descriptives:\033[0m\n")
recasts_KOLTOP40_20days.describe()


[1m[30mColumns in the DataFrame:[0m

Index(['created_at', 'embeds', 'fid', 'hash', 'parent_fid', 'parent_hash',
       'parent_url', 'text'],
      dtype='object')

[1m[30mCases and columns in the DataFrame:[0m

(20427, 8)

[1m[30mColumn datatypes list:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20427 entries, 0 to 20426
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   created_at   20427 non-null  object
 1   embeds       20427 non-null  object
 2   fid          20427 non-null  int64 
 3   hash         20427 non-null  object
 4   parent_fid   20427 non-null  int64 
 5   parent_hash  20427 non-null  object
 6   parent_url   0 non-null      object
 7   text         20427 non-null  object
dtypes: int64(2), object(6)
memory usage: 1.2+ MB

[1m[30mMissing values:[0m

created_at         0
embeds             0
fid                0
hash               0
parent_fid         0
parent_hash        0
parent_

Unnamed: 0,fid,parent_fid
count,20427.0,20427.0
mean,202408.349293,283350.163656
std,128380.697023,193654.399029
min,108.0,2.0
25%,16098.0,16098.0
50%,247143.0,311933.0
75%,268455.0,422233.0
max,562300.0,817958.0


In [None]:
# Downloading VADER lexicon
nltk.download('vader_lexicon')

# Initializing VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Adding domain-specific terms to the VADER lexicon
new_words = {
    'farcaster': 2.0,
    'metamask': 1.5,
    'consensys': 1.5,
    'infura': 1.5,
    'linea': 1.5,
    'ethereum': 2.0
}
sid.lexicon.update(new_words)

# Defining the function to apply sentiment analysis
def analyze_sentiment(text, threshold=0.05):
    sentiment_dict = sid.polarity_scores(text)
    if sentiment_dict['compound'] >= threshold:
        return 'positive'
    elif sentiment_dict['compound'] <= -threshold:
        return 'negative'
    else:
        return 'neutral'

# Applying the sentiment analysis function to the 'text' column in casts_recasts_KOLTOP300
casts_recasts_KOLTOP300['sentiment'] = casts_recasts_KOLTOP300['text'].apply(analyze_sentiment)

# Filter casts and recasts mentioning specific keywords in casts_recasts_KOLTOP300
brand_keywords = ['consensys', 'metamask', 'meta mask', 'infura', 'linea']
keyword_filter = casts_recasts_KOLTOP300['text'].str.contains('|'.join(brand_keywords), case=False, na=False)

# Apply the filter and calculate positive sentiment counts
filtered_brand_data = casts_recasts_KOLTOP300[keyword_filter]
positive_sentiment_counts = filtered_brand_data[filtered_brand_data['sentiment'] == 'positive'].groupby('fid').size()
total_sentiment_counts = filtered_brand_data.groupby('fid').size()

# Calculate the percentage of positive sentiments
sentiment_percentage = (positive_sentiment_counts / total_sentiment_counts * 100).reset_index()
sentiment_percentage.columns = ['fid', 'sentiment_brand_products']

# Merge the sentiment percentages back to kol_scores
kol_scores = kol_scores.merge(sentiment_percentage, on='fid', how='left')

# Fill NaN values with 0 (for fids that have no corresponding texts with the specified keywords)
kol_scores['sentiment_brand_products'] = kol_scores['sentiment_brand_products'].fillna(0)

# Applying the sentiment analysis function to the 'text' column in recasts_KOLTOP40_20days
recasts_KOLTOP40_20days['sentiment'] = recasts_KOLTOP40_20days['text'].apply(analyze_sentiment)

# Filter rows in recasts_KOLTOP40_20days where parent_fid is in kol_scores['fid']
recasts_KOLTOP40_20days_filtered = recasts_KOLTOP40_20days[recasts_KOLTOP40_20days['parent_fid'].isin(kol_scores['fid'])]

# Calculating the percentage of positive sentiments for each parent_fid
positive_recast_sentiment_counts = recasts_KOLTOP40_20days_filtered[recasts_KOLTOP40_20days_filtered['sentiment'] == 'positive'].groupby('parent_fid').size()
total_recast_sentiment_counts = recasts_KOLTOP40_20days_filtered.groupby('parent_fid').size()

# Calculating the percentage of positive sentiments for recasts
recast_sentiment_percentage = (positive_recast_sentiment_counts / total_recast_sentiment_counts * 100).reset_index()
recast_sentiment_percentage.columns = ['parent_fid', 'recasts_sentiment']

# Merging the recast sentiment percentages back to kol_scores
kol_scores = kol_scores.merge(recast_sentiment_percentage, left_on='fid', right_on='parent_fid', how='left')

# Filling NaN values with 0 (for fids that have no corresponding recasts)
kol_scores['recasts_sentiment'] = kol_scores['recasts_sentiment'].fillna(0)

# Dropping the 'parent_fid' column used for merging
kol_scores = kol_scores.drop(columns=['parent_fid'])

# Filtering rows in casts_recasts_KOLTOP300 containing 'ethereum' in text
casts_recasts_KOLTOP300_ethereum_filtered = casts_recasts_KOLTOP300[casts_recasts_KOLTOP300['text'].str.contains('ethereum', case=False, na=False)]

# Calculating the percentage of positive sentiments for each fid
positive_ethereum_sentiment_counts = casts_recasts_KOLTOP300_ethereum_filtered[casts_recasts_KOLTOP300_ethereum_filtered['sentiment'] == 'positive'].groupby('fid').size()
total_ethereum_sentiment_counts = casts_recasts_KOLTOP300_ethereum_filtered.groupby('fid').size()

# Calculating the percentage of positive sentiments for ethereum
ethereum_sentiment_percentage = (positive_ethereum_sentiment_counts / total_ethereum_sentiment_counts * 100).reset_index()
ethereum_sentiment_percentage.columns = ['fid', 'sentiment_ethereum']

# Merging the ethereum sentiment percentages back to kol_scores
kol_scores = kol_scores.merge(ethereum_sentiment_percentage, on='fid', how='left')

# Filling NaN values with 0 (for fids that have no corresponding texts with the specified keywords)
kol_scores['sentiment_ethereum'] = kol_scores['sentiment_ethereum'].fillna(0)

# Normalizing the sentiment columns
def normalize(column):
    return (column - column.min()) / (column.max() - column.min())

kol_scores['norm_sentiment_brand_products'] = normalize(kol_scores['sentiment_brand_products'])
kol_scores['norm_recasts_sentiment'] = normalize(kol_scores['recasts_sentiment'])
kol_scores['norm_sentiment_ethereum'] = normalize(kol_scores['sentiment_ethereum'])

# Normalizing other columns
normalize_columns = [
    'number_of_followers',
    'number_of_casts',
    'number_of_recasts',
    'number_of_replies',
    'number_of_mentions',
    'number_of_likes',
    'days_since_pb',
    'post_virality_score',
    'followers_from_top_300'
]

for column in normalize_columns:
    norm_column = 'norm_' + column
    kol_scores[norm_column] = normalize(kol_scores[column] / kol_scores['weeks_since_fid_creation'])

# Considered weights
weights = {
    'norm_number_of_followers': 0.12,
    'norm_number_of_casts': 0.08,
    'norm_number_of_recasts': 0.12,
    'norm_number_of_replies': 0.12,
    'norm_number_of_mentions': 0.08,
    'norm_number_of_likes': 0.08,
    'norm_days_since_pb': 0.04,
    'norm_post_virality_score': 0.08,
    'norm_followers_from_top_300': 0.08,
    'norm_sentiment_brand_products': 0.10,
    'norm_recasts_sentiment': 0.06,
    'norm_sentiment_ethereum': 0.04
}

# Calculating the new KOL_score_final_with_sentiment
kol_scores['kol_score_final_with_sentiment'] = (
    kol_scores['norm_number_of_followers'] * weights['norm_number_of_followers'] +
    kol_scores['norm_number_of_casts'] * weights['norm_number_of_casts'] +
    kol_scores['norm_number_of_recasts'] * weights['norm_number_of_recasts'] +
    kol_scores['norm_number_of_replies'] * weights['norm_number_of_replies'] +
    kol_scores['norm_number_of_mentions'] * weights['norm_number_of_mentions'] +
    kol_scores['norm_number_of_likes'] * weights['norm_number_of_likes'] +
    kol_scores['norm_days_since_pb'] * weights['norm_days_since_pb'] +
    kol_scores['norm_post_virality_score'] * weights['norm_post_virality_score'] +
    kol_scores['norm_followers_from_top_300'] * weights['norm_followers_from_top_300'] +
    kol_scores['norm_sentiment_brand_products'] * weights['norm_sentiment_brand_products'] +
    kol_scores['norm_recasts_sentiment'] * weights['norm_recasts_sentiment'] +
    kol_scores['norm_sentiment_ethereum'] * weights['norm_sentiment_ethereum']
)

# Dropping normalized columns
kol_scores = kol_scores.drop(columns=[col for col in kol_scores.columns if col.startswith('norm_')])

# Sorting the DataFrame by the new score in descending order
kol_scores = kol_scores.sort_values(by='kol_score_final_with_sentiment', ascending=False)

# Displaying the resulting DataFrame with the new column
display(kol_scores.head(10))

# Save to CSV and download
kol_scores.to_csv('kol_score_with_sentiment.csv', index=False)
files.download('kol_score_with_sentiment.csv')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Unnamed: 0,fid,kol_score,followers_from_top_300,kol_score_final,consensys_mentions,metamask_mentions,infura_mentions,linea_mentions,ethereum_mentions,mentions_to_products,...,number_of_recasts,number_of_replies,number_of_mentions,number_of_likes,days_since_pb,post_virality_score,sentiment_brand_products,recasts_sentiment,sentiment_ethereum,kol_score_final_with_sentiment
37,247143,0.26067,102,0.063567,0,2,0,0,14,2,...,171769,188958,47549,645496,61,0.170509,100.0,37.5,100.0,0.579761
19,281836,0.311145,133,0.078615,0,0,0,0,8,0,...,226570,77717,71435,791269,61,0.977142,0.0,29.032258,50.0,0.515807
0,3,0.218382,290,0.129338,3,27,0,27,267,54,...,130719,23655,3528,967478,61,1.975425,73.684211,47.5,88.764045,0.457473
45,269694,0.000693,166,0.060069,2,24,2,0,56,26,...,52273,85298,1075,601284,61,0.01108,78.571429,44.117647,89.285714,0.411673
2,99,0.077765,280,0.097776,0,3,0,2,35,5,...,84964,4914,1907,493595,61,2.246485,80.0,39.393939,82.857143,0.396381
51,234616,0.035013,146,0.053501,0,66,0,0,6,66,...,78088,86236,17641,273307,61,0.58106,57.575758,51.538462,100.0,0.349322
3,2,0.04497,273,0.094497,0,4,0,6,36,10,...,33352,3164,1614,200520,61,1.299094,100.0,51.428571,77.777778,0.348078
103,309710,0.00802,68,0.020802,0,3,0,0,7,3,...,60034,56128,44628,161741,61,0.12324,100.0,100.0,57.142857,0.339535
81,758919,0.090321,84,0.036532,0,5,0,1,2,6,...,2285,398,12,18533,27,0.099132,100.0,47.368421,100.0,0.337246
12,347,0.011363,253,0.081136,0,8,4,2,16,14,...,11221,3510,831,99343,61,0.328245,71.428571,100.0,75.0,0.318269


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **STEP 3: Casts from TOP20 KOLs - Last 30 Days**

In [None]:
# Fetching the latest result of the query
query_id = 3962794  # casts_KOLs_TOP20_30days
query_result = dune.get_latest_result(query_id)

# Accessing the rows attribute to get the data
result_data = None

# Checking if 'rows' contains data
if hasattr(query_result.result, 'rows'):
    result_data = query_result.result.rows
else:
    print("No 'rows' attribute found in the result.")

# Converting result_data to DataFrame
if result_data:
    # Checking if result_data is directly a list of dictionaries
    if isinstance(result_data, list) and len(result_data) > 0 and isinstance(result_data[0], dict):
        casts_KOLTOP20_30days = pd.DataFrame(result_data)
    elif isinstance(result_data, dict) and 'data' in result_data:
        casts_KOLTOP20_30days = pd.DataFrame(result_data['data'])
    else:
        print("Unsupported data format:", type(result_data))
else:
    print("No data available or unable to access data.")

# Displaying the DataFrame
if 'casts_KOLTOP20_30days' in locals():
    print(casts_KOLTOP20_30days.head())
else:
    print("Failed to create DataFrame.")

            created_at  days_since_creation  \
0  2024-08-02 04:34:49                    5   
1  2024-08-02 05:31:21                    5   
2  2024-08-08 08:34:49                    0   
3  2024-08-08 08:30:42                    0   
4  2024-08-08 09:05:33                    0   

                                              embeds     fid  \
0  [{"url": "https://imagedelivery.net/BXluQx4ige...  758919   
1  [{"url": "https://liquidhammer.vercel.app/acti...  247143   
2       [{"url": "https://yo-dudes.vercel.app/api"}]  281836   
3  [{"url": "https://imagedelivery.net/BXluQx4ige...  562300   
4  [{"url": "http://far.cards"}, {"url": "https:/...  562300   

                                         hash  month parent_fid parent_hash  \
0  0x5237354d91039b7eac87f83385f281406a1f5ed6      8       None        None   
1  0x77b00147e5dd45a3ca77d74eec6d914dc8a51e95      8       None        None   
2  0x83a87803ce99a72f135560befdca10db81534aa8      8       None        None   
3  0x4bdccec4e68

In [None]:
# Producing a basic description of the dataset and respective variables
print("\n\033[1m\033[30mColumns in the DataFrame:\033[0m\n")
print(casts_KOLTOP20_30days.columns)
print("\n\033[1m\033[30mCases and columns in the DataFrame:\033[0m\n")
print (casts_KOLTOP20_30days.shape)
print("\n\033[1m\033[30mColumn datatypes list:\033[0m\n")
casts_KOLTOP20_30days.info()
print("\n\033[1m\033[30mMissing values:\033[0m\n")
print(casts_KOLTOP20_30days.isna().sum())

# Printing main descriptive statistics
print("\n\033[1m\033[30mVariable main descriptives:\033[0m\n")
casts_KOLTOP20_30days.describe()


[1m[30mColumns in the DataFrame:[0m

Index(['created_at', 'days_since_creation', 'embeds', 'fid', 'hash', 'month',
       'parent_fid', 'parent_hash', 'parent_url', 'text', 'year'],
      dtype='object')

[1m[30mCases and columns in the DataFrame:[0m

(3944, 11)

[1m[30mColumn datatypes list:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3944 entries, 0 to 3943
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   created_at           3944 non-null   object
 1   days_since_creation  3944 non-null   int64 
 2   embeds               3944 non-null   object
 3   fid                  3944 non-null   int64 
 4   hash                 3944 non-null   object
 5   month                3944 non-null   int64 
 6   parent_fid           0 non-null      object
 7   parent_hash          0 non-null      object
 8   parent_url           1848 non-null   object
 9   text                 3944 non-null   obje

Unnamed: 0,days_since_creation,fid,month,year
count,3944.0,3944.0,3944.0,3944.0
mean,15.535751,415891.629817,7.194726,2024.0
std,8.616214,304945.23296,0.39604,0.0
min,0.0,3.0,7.0,2024.0
25%,8.0,602.0,7.0,2024.0
50%,16.0,601131.0,7.0,2024.0
75%,23.0,657052.0,7.0,2024.0
max,29.0,784003.0,8.0,2024.0


In [None]:
# Downloading necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('vader_lexicon')

# Defining a list of custom stop words
custom_stop_words = {
    'let', 'know', 'check', 'find', 'another', 'nice', 'http', 'complete', 'challenged', 'created', 'matched',
    'today', 'like', 'follow', 'recast', 'get', 'good', 'join', 'lets', 'new', 'left', 'back', 'great', 'know',
    'nice', 'done', 'need', 'think', 'guys', 'people', 'thank', 'hello', 'bro', 'lol', 'much', 'really', 'right',
    'thats', 'hot', 'cool', 'use', 'yes', 'no', 'also', 'everyone', 'something', 'post', 'whats', 'come', 'many',
    'sure', 'ill', 'take', 'keep', 'every', 'well', 'ive', 'even', 'anyone', 'always', 'big', 'made', 'could',
    'using', 'feel', 'never', 'find', 'ready', 'thing', 'followers', 'better', 'let', 'already', 'looking','http',
    'look', 'say', 'try', 'yet', 'miss', 'give', 'coming', 'getting', 'last', 'yeah', 'fam', 'happy', 'hope',
    'days', 'start', 'things', 'gonna', 'another', 'please', 'lot', 'wait', 'amazing', 'users', 'user', 'around',
    'week', 'worth', 'share', 'joined', 'man', 'following', 'trying', 'guy', 'might', 'though', 'since', 'waiting',
    'actually', 'less', 'anything', 'seems', 'haha', 'makes', 'may', 'stuff', 'ago', 'finally', 'definitely',
    'everything', 'end', 'shit', 'fuck', 'damn', 'lmao', 'bad', 'one', 'first', 'comment', 'time', 'see', 'got',
    'make', 'still', 'want', 'day', 'way', 'cant', 'going', 'would', 'next', 'real', 'cast', 'reply', 'click',
    'help', 'hey', 'nothing', 'free', 'love', 'similar', 'seem', 'worked', 'basically', 'directly', 'especially',
    'likely', 'probably', 'exactly', 'totally', 'fully', 'currently', 'usually', 'absolutely', 'truly', 'recently',
    'simply', 'literally', 'via', 'sir', 'idk', 'either', 'omg', 'thanks', 'dude', 'based', 'best',
    'hour', 'social'
}

# Preprocessing the text
def preprocess_text(text):
    # Removing URLs
    text = re.sub(r'http\S+', '', text)
    stop_words = set(stopwords.words('english'))
    stop_words.update(custom_stop_words)  # Update the stop words with custom stop words
    wordnet_lemmatizer = WordNetLemmatizer()

    # Tokenizing and removing stop words
    words = word_tokenize(text.lower())
    words = [word for word in words if word.isalnum() and word not in stop_words]

    # Lemmatizing words
    words = [wordnet_lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

# Assuming casts_KOLTOP20_30days DataFrame is already loaded
casts_KOLTOP20_30days['clean_text'] = casts_KOLTOP20_30days['text'].apply(preprocess_text)

# Extracting Keywords using TF-IDF
tfidf_vectorizer = TfidfVectorizer()  # Remove the max_features parameter
tfidf_matrix = tfidf_vectorizer.fit_transform(casts_KOLTOP20_30days['clean_text'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

keywords = []
for row in tfidf_matrix:
    keywords.append([tfidf_feature_names[i] for i in row.nonzero()[1]])

casts_KOLTOP20_30days['keywords'] = keywords

# Converting documents to a corpus of word lists
texts = [text.split() for text in casts_KOLTOP20_30days['clean_text']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Determining the optimal number of topics using coherence scores
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10, random_state=42)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

limit = 30  # maximum number of topics
start = 10   # minimum number of topics
step = 1    # step size

model_list, coherence_values = compute_coherence_values(dictionary, corpus, texts, limit, start, step)

# Selecting the model with the highest coherence score
optimal_model = model_list[coherence_values.index(max(coherence_values))]

# Displaying coherence scores
for m, cv in zip(range(start, limit, step), coherence_values):
    print(f"Num Topics = {m}, Coherence Value = {cv}")

# Saving optimal number of topics
optimal_num_topics = start + coherence_values.index(max(coherence_values))
print(f"Optimal number of topics: {optimal_num_topics}")

# Applying LDA with the optimal number of topics
lda_model = gensim.models.LdaModel(corpus, num_topics=optimal_num_topics, id2word=dictionary, passes=10, random_state=42)
topics = lda_model.print_topics(num_words=5)

# Assigning topics to documents
def get_document_topics(text):
    bow = dictionary.doc2bow(text.split())
    document_topics = lda_model.get_document_topics(bow)
    return document_topics

casts_KOLTOP20_30days['topics'] = casts_KOLTOP20_30days['clean_text'].apply(get_document_topics)

# Extracting the dominant topic for each document
def get_dominant_topic(topics):
    dominant_topic = max(topics, key=lambda x: x[1])[0]
    return dominant_topic

casts_KOLTOP20_30days['dominant_topic'] = casts_KOLTOP20_30days['topics'].apply(get_dominant_topic)

# Calculating the frequency of each topic
topic_freq = casts_KOLTOP20_30days['dominant_topic'].value_counts().reset_index()
topic_freq.columns = ['Topic', 'Frequency']

# Performing Sentiment Analysis
sid = SentimentIntensityAnalyzer()
casts_KOLTOP20_30days['sentiment'] = casts_KOLTOP20_30days['text'].apply(lambda x: sid.polarity_scores(x)['compound'])

# Calculating average sentiment for each topic
topic_sentiment = casts_KOLTOP20_30days.groupby('dominant_topic')['sentiment'].mean().reset_index()
topic_sentiment.columns = ['Topic', 'Average_Sentiment']

# Merging with topic frequency
topic_freq = pd.merge(topic_freq, topic_sentiment, on='Topic')

# Displaying the resulting DataFrame
display(casts_KOLTOP20_30days.head(10))

# Performing Keyword Analysis
# Flattening the list of keywords and count the frequencies
all_keywords = [keyword for sublist in casts_KOLTOP20_30days['keywords'] for keyword in sublist]
keyword_counts = pd.Series(all_keywords).value_counts().reset_index()
keyword_counts.columns = ['Keyword', 'Frequency']
keyword_counts = keyword_counts.sort_values(by='Frequency', ascending=False)

# Calculating average sentiment for each keyword
keyword_sentiment_list = []
for keyword in keyword_counts['Keyword']:
    avg_sentiment = casts_KOLTOP20_30days[casts_KOLTOP20_30days['clean_text'].str.contains(keyword)]['sentiment'].mean()
    keyword_sentiment_list.append(avg_sentiment)

keyword_counts['Average_Sentiment'] = keyword_sentiment_list

# Saving keyword analysis and topic modeling results to CSV
keyword_counts.to_csv('casts_keyword_analysis.csv', index=False)
topic_freq.to_csv('casts_topic_modeling_results.csv', index=False)

# Displaying the topic frequencies
display(topic_freq)

# Displaying top keywords
display(keyword_counts.head(50))

# Downloading the CSV files
from google.colab import files
files.download('casts_keyword_analysis.csv')
files.download('casts_topic_modeling_results.csv')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Num Topics = 10, Coherence Value = 0.44567380768296116
Num Topics = 11, Coherence Value = 0.4694022874126866
Num Topics = 12, Coherence Value = 0.4419009993372422
Num Topics = 13, Coherence Value = 0.45338842799189716
Num Topics = 14, Coherence Value = 0.441660929592265
Num Topics = 15, Coherence Value = 0.445853712528087
Num Topics = 16, Coherence Value = 0.42207906529264594
Num Topics = 17, Coherence Value = 0.4357068289091968
Num Topics = 18, Coherence Value = 0.41829051046389165
Num Topics = 19, Coherence Value = 0.44895432437075833
Num Topics = 20, Coherence Value = 0.4404452027942106
Num Topics = 21, Coherence Value = 0.43903536591553294
Num Topics = 22, Coherence Value = 0.45260160423009455
Num Topics = 23, Coherence Value = 0.42426542651341764
Num Topics = 24, Coherence Value = 0.434641423659431
Num Topics = 25, Coherence Value = 0.4454871824894054
Num Topics = 26, Coherence Value = 0.4492196478509942
Num Topics = 27, Coherence Value = 0.43221476654660135
Num Topics = 28, Coher

Unnamed: 0,created_at,days_since_creation,embeds,fid,hash,month,parent_fid,parent_hash,parent_url,text,year,clean_text,keywords,topics,dominant_topic,sentiment
0,2024-08-02 04:34:49,5,"[{""url"": ""https://imagedelivery.net/BXluQx4ige...",758919,0x5237354d91039b7eac87f83385f281406a1f5ed6,8,,,https://warpcast.com/~/channel/dropglobalnews,Social media user reaction to this story in Pa...,2024,medium reaction story pakistan extremely onlin...,"[true, pakistani, online, extremely, pakistan,...","[(0, 0.010103573), (1, 0.010104051), (2, 0.010...",8,0.7906
1,2024-08-02 05:31:21,5,"[{""url"": ""https://liquidhammer.vercel.app/acti...",247143,0x77b00147e5dd45a3ca77d74eec6d914dc8a51e95,8,,,https://warpcast.com/~/channel/lp,yall know what time it is\n\ngather around lem...,2024,yall gather lem tell yall story,"[tell, lem, gather, yall, story]","[(0, 0.012991144), (1, 0.012990771), (2, 0.012...",8,0.0
2,2024-08-08 08:34:49,0,"[{""url"": ""https://yo-dudes.vercel.app/api""}]",281836,0x83a87803ce99a72f135560befdca10db81534aa8,8,,,,https://yo-dudes.vercel.app/api,2024,,[],"[(0, 0.09090909), (1, 0.09090909), (2, 0.09090...",0,0.0
3,2024-08-08 08:30:42,0,"[{""url"": ""https://imagedelivery.net/BXluQx4ige...",562300,0x4bdccec4e681bb0416803e8d01fb25150450ce03,8,,,https://warpcast.com/~/channel/hunt,Have you checked the $hunt-tip ranks today?\n\...,2024,checked rank top favorite builder list clappin...,"[tipping, clapping, list, builder, favorite, t...","[(1, 0.48585856), (7, 0.4595752)]",1,0.6239
4,2024-08-08 09:05:33,0,"[{""url"": ""http://far.cards""}, {""url"": ""https:/...",562300,0xe02c53f5523b492b08f9f89a89ad5df9dc9981cf,8,,,https://warpcast.com/~/channel/mintclub,"#18,000 NFT that created in Mint Club 🥳 🎉 \n\n...",2024,nft mint club,"[club, mint, nft]","[(0, 0.772716), (1, 0.022727625), (2, 0.022727...",0,0.25
5,2024-08-08 09:26:05,0,[],247143,0x624910dd4c811f5531ec8d4dbc4009accecb3c27,8,,,chain://eip155:7777777/erc721:0x5d6a07d07354f8...,"According to CoinGecko, the circulating supply...",2024,according coingecko circulating supply degen b...,"[possible, growth, plenty, million, 40, cap, m...","[(9, 0.94948804)]",9,0.3818
6,2024-08-08 08:43:59,0,"[{""url"": ""https://imagedelivery.net/BXluQx4ige...",562300,0x2f544f324a06cf8e15381883e4e2b64d4abc6ab8,8,,,https://warpcast.com/~/channel/mintclub,"We're hammering, coding, and putting the final...",2024,hammering coding putting final touch soon able...,"[chain, across, asset, purchase, able, soon, t...","[(0, 0.2516104), (5, 0.53525954), (10, 0.16116...",5,0.2481
7,2024-08-02 05:03:00,5,"[{""url"": ""https://thecard.fun/war-tournament/c...",657052,0x5dcec79a6160b1b8087b21da53adb81a1441f107,8,,,,has challenged to a battle!\nComplete the ba...,2024,battle battle,[battle],"[(0, 0.030303065), (1, 0.030303065), (2, 0.030...",8,-0.7339
8,2024-08-02 05:17:30,5,"[{""url"": ""https://moxie-frames.airstack.xyz/st...",281836,0xbb653fa9702168a763270901234e1ef5aff5631a,8,,,https://warpcast.com/~/channel/airstack,dudes fan tokens going brrrrrr,2024,dude fan token brrrrrr,"[brrrrrr, token, fan, dude]","[(0, 0.018187733), (1, 0.018187737), (2, 0.018...",9,0.3182
9,2024-08-02 03:05:55,5,"[{""url"": ""https://moxie-frames.airstack.xyz/sa...",602,0xbc23de4f9d7909c6ab23c17b72f268aa777d9785,8,,,,I just bid for 's Fan Tokens powered by cc,2024,bid fan token powered cc,"[cc, powered, bid, token, fan]","[(0, 0.01515161), (1, 0.01515161), (2, 0.01515...",3,0.3182


Unnamed: 0,Topic,Frequency,Average_Sentiment
0,2,1239,0.767829
1,0,434,0.154679
2,4,371,0.384725
3,3,320,0.380366
4,10,304,0.275539
5,8,294,-0.212017
6,6,267,0.239964
7,9,207,0.139723
8,7,193,0.214502
9,1,187,0.206219


Unnamed: 0,Keyword,Frequency,Average_Sentiment
0,channel,1316,0.758766
1,game,1274,0.765821
2,play,1240,0.769537
3,player,1232,0.777392
4,token,250,0.563489
5,moxie,241,0.527506
6,fan,227,0.582245
7,farcaster,209,0.466484
8,battle,136,-0.699716
9,frame,127,0.422143


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Displaying the topic frequency DataFrame
display(topic_freq)

Unnamed: 0,Topic,Frequency,Average_Sentiment
0,2,1239,0.767829
1,0,434,0.154679
2,4,371,0.384725
3,3,320,0.380366
4,10,304,0.275539
5,8,294,-0.212017
6,6,267,0.239964
7,9,207,0.139723
8,7,193,0.214502
9,1,187,0.206219


In [None]:
# Displaying the keywords dataframe
display(keyword_counts.head(50))

Unnamed: 0,Keyword,Frequency,Average_Sentiment
0,channel,1316,0.758766
1,game,1274,0.765821
2,play,1240,0.769537
3,player,1232,0.777392
4,token,250,0.563489
5,moxie,241,0.527506
6,fan,227,0.582245
7,farcaster,209,0.466484
8,battle,136,-0.699716
9,frame,127,0.422143


# **STEP 4: Recasts from TOP20 KOLs - Last 30 Days**

In [None]:
# Fetching the latest result of the query
query_id = 3962797 #recasts_KOLs_TOP20_30days
query_result = dune.get_latest_result(query_id)

# Accessing the rows attribute to get the data
result_data = None

# Checking if 'rows' contains data
if hasattr(query_result.result, 'rows'):
    result_data = query_result.result.rows
else:
    print("No 'rows' attribute found in the result.")

# Converting result_data to DataFrame
if result_data:
    # Checking if result_data is directly a list of dictionaries
    if isinstance(result_data, list) and len(result_data) > 0 and isinstance(result_data[0], dict):
        recasts_KOLTOP20_30days = pd.DataFrame(result_data)
    elif isinstance(result_data, dict) and 'data' in result_data:
        recasts_KOLTOP20_30days = pd.DataFrame(result_data['data'])
    else:
        print("Unsupported data format:", type(result_data))
else:
    print("No data available or unable to access data.")

# Displaying the DataFrame
if 'recasts_KOLTOP20_30days' in locals():
    print(recasts_KOLTOP20_30days.head())
else:
    print("Failed to create DataFrame.")

            created_at  days_since_creation embeds     fid  \
0  2024-08-02 05:34:47                    3     []  247143   
1  2024-08-06 11:58:30                    0     []  247143   
2  2024-08-06 11:59:25                    0     []  247143   
3  2024-08-06 11:41:35                    0     []     602   
4  2024-08-06 12:22:24                    0     []  247143   

                                         hash  parent_fid  \
0  0x09133b34ea0a2957536ff15ce47e633d03d64ffb        8446   
1  0xca1dda39f7d7788b42810c7e0ea58ec82c156e22      394023   
2  0x9e8b998f96658a9e00715d9542c268492c6c5dd4      418671   
3  0xc55fae00db1cba5b977b02852f0d10aca9c33e86      758919   
4  0x443a9a507318e92e5f4c256eae3f0e81035fc8e3        3652   

                                  parent_hash parent_url  \
0  0x549d0281ea9f5c9c8fc33090ce3d7a659d346fb1       None   
1  0x25ff49342686bd9f949fa2a2948d4807e30e66ff       None   
2  0x0a1cc4894f22ed23a06536e030eec8cbbf28630f       None   
3  0x86605f738e3d7bf

In [None]:
# Producing a basic description of the dataset and respective variables
print("\n\033[1m\033[30mColumns in the DataFrame:\033[0m\n")
print(recasts_KOLTOP20_30days.columns)
print("\n\033[1m\033[30mCases and columns in the DataFrame:\033[0m\n")
print (recasts_KOLTOP20_30days.shape)
print("\n\033[1m\033[30mColumn datatypes list:\033[0m\n")
recasts_KOLTOP20_30days.info()
print("\n\033[1m\033[30mMissing values:\033[0m\n")
print(recasts_KOLTOP20_30days.isna().sum())

# Printing main descriptive statistics
print("\n\033[1m\033[30mVariable main descriptives:\033[0m\n")
recasts_KOLTOP20_30days.describe()


[1m[30mColumns in the DataFrame:[0m

Index(['created_at', 'days_since_creation', 'embeds', 'fid', 'hash',
       'parent_fid', 'parent_hash', 'parent_url', 'text'],
      dtype='object')

[1m[30mCases and columns in the DataFrame:[0m

(18429, 9)

[1m[30mColumn datatypes list:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18429 entries, 0 to 18428
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   created_at           18429 non-null  object
 1   days_since_creation  18429 non-null  int64 
 2   embeds               18429 non-null  object
 3   fid                  18429 non-null  int64 
 4   hash                 18429 non-null  object
 5   parent_fid           18429 non-null  int64 
 6   parent_hash          18429 non-null  object
 7   parent_url           0 non-null      object
 8   text                 18429 non-null  object
dtypes: int64(3), object(6)
memory usage: 1.3+ MB

[1m[30mMi

Unnamed: 0,days_since_creation,fid,parent_fid
count,18429.0,18429.0,18429.0
mean,16.382332,322912.00076,404377.823593
std,8.652939,240706.709858,228703.883088
min,0.0,3.0,2.0
25%,10.0,247143.0,254128.0
50%,17.0,247143.0,434268.0
75%,24.0,642133.0,653407.0
max,29.0,784003.0,816799.0


In [None]:
# Preprocessing the text
def preprocess_text(text):
    # Removing URLs
    text = re.sub(r'http\S+', '', text)
    stop_words = set(stopwords.words('english'))
    stop_words.update(custom_stop_words)  # Using the already defined custom stop words
    wordnet_lemmatizer = WordNetLemmatizer()

    # Tokenizing and removing stop words
    words = word_tokenize(text.lower())
    words = [word for word in words if word.isalnum() and word not in stop_words and not word.isdigit()]

    # Lemmatizing words
    words = [wordnet_lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

# Assuming recasts_KOLTOP20_30days DataFrame is already loaded
recasts_KOLTOP20_30days['clean_text'] = recasts_KOLTOP20_30days['text'].apply(preprocess_text)

# Extracting Keywords using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(recasts_KOLTOP20_30days['clean_text'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

keywords = []
for row in tfidf_matrix:
    keywords.append([tfidf_feature_names[i] for i in row.nonzero()[1]])

recasts_KOLTOP20_30days['keywords'] = keywords

# Converting documents to a corpus of word lists
texts_recasts = [text.split() for text in recasts_KOLTOP20_30days['clean_text']]
dictionary_recasts = corpora.Dictionary(texts_recasts)
corpus_recasts = [dictionary_recasts.doc2bow(text) for text in texts_recasts]

# Determining the optimal number of topics using coherence scores
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=10, random_state=42)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

limit = 30  # maximum number of topics
start = 10   # minimum number of topics
step = 1    # step size

model_list_recasts, coherence_values_recasts = compute_coherence_values(dictionary_recasts, corpus_recasts, texts_recasts, limit, start, step)

# Selecting the model with the highest coherence score
optimal_model_recasts = model_list_recasts[coherence_values_recasts.index(max(coherence_values_recasts))]

# Displaying coherence scores
for m, cv in zip(range(start, limit, step), coherence_values_recasts):
    print(f"Num Topics = {m}, Coherence Value = {cv}")

# Saving optimal number of topics
optimal_num_topics_recasts = start + coherence_values_recasts.index(max(coherence_values_recasts))
print(f"Optimal number of topics: {optimal_num_topics_recasts}")

# Applying LDA with the optimal number of topics
lda_model_recasts = gensim.models.LdaModel(corpus_recasts, num_topics=optimal_num_topics_recasts, id2word=dictionary_recasts, passes=10, random_state=42)
topics_recasts = lda_model_recasts.print_topics(num_words=5)

# Assigning topics to documents
def get_document_topics(text):
    bow = dictionary_recasts.doc2bow(text.split())
    document_topics = lda_model_recasts.get_document_topics(bow)
    return document_topics

recasts_KOLTOP20_30days['topics'] = recasts_KOLTOP20_30days['clean_text'].apply(lambda text: get_document_topics(text))

# Extracting the dominant topic for each document
recasts_KOLTOP20_30days['dominant_topic'] = recasts_KOLTOP20_30days['topics'].apply(get_dominant_topic)

# Calculating the frequency of each topic
topic_freq_recasts = recasts_KOLTOP20_30days['dominant_topic'].value_counts().reset_index()
topic_freq_recasts.columns = ['Topic', 'Frequency']

# Performing Sentiment Analysis
sid = SentimentIntensityAnalyzer()
recasts_KOLTOP20_30days['sentiment'] = recasts_KOLTOP20_30days['text'].apply(lambda x: sid.polarity_scores(x)['compound'])

# Reordering columns
recasts_KOLTOP20_30days = recasts_KOLTOP20_30days[['text', 'clean_text', 'keywords', 'topics', 'dominant_topic', 'sentiment']]

# Calculating average sentiment for each topic
topic_sentiment_recasts = recasts_KOLTOP20_30days.groupby('dominant_topic')['sentiment'].mean().reset_index()
topic_sentiment_recasts.columns = ['Topic', 'Average_Sentiment']

# Merging with topic frequency
topic_freq_recasts = pd.merge(topic_freq_recasts, topic_sentiment_recasts, on='Topic')

# Displaying the resulting DataFrame
display(recasts_KOLTOP20_30days.head(10))

# Performing Keyword Analysis
# Flattening the list of keywords and count the frequencies
all_keywords_recasts = [keyword for sublist in recasts_KOLTOP20_30days['keywords'] for keyword in sublist]
keyword_counts_recasts = pd.Series(all_keywords_recasts).value_counts().reset_index()
keyword_counts_recasts.columns = ['Keyword', 'Frequency']
keyword_counts_recasts = keyword_counts_recasts.sort_values(by='Frequency', ascending=False)

# Calculating average sentiment for each keyword
keyword_sentiment_list_recasts = []
for keyword in keyword_counts_recasts['Keyword']:
    avg_sentiment = recasts_KOLTOP20_30days[recasts_KOLTOP20_30days['clean_text'].str.contains(keyword)]['sentiment'].mean()
    keyword_sentiment_list_recasts.append(avg_sentiment)

keyword_counts_recasts['Average_Sentiment'] = keyword_sentiment_list_recasts

# Saving keyword analysis and topic modeling results to CSV
keyword_counts_recasts.to_csv('recasts_keyword_analysis.csv', index=False)
topic_freq_recasts.to_csv('recasts_topic_modeling_results.csv', index=False)

# Displaying the topic frequencies
display(topic_freq_recasts)

# Displaying top keywords
display(keyword_counts_recasts.head(50))

# Downloading the CSV files
from google.colab import files
files.download('recasts_keyword_analysis.csv')
files.download('recasts_topic_modeling_results.csv')


Num Topics = 10, Coherence Value = 0.5707290534534899
Num Topics = 11, Coherence Value = 0.5629419726022286
Num Topics = 12, Coherence Value = 0.5292021497367512
Num Topics = 13, Coherence Value = 0.5438023513389033
Num Topics = 14, Coherence Value = 0.5333377791702534
Num Topics = 15, Coherence Value = 0.5414552393755794
Num Topics = 16, Coherence Value = 0.5418690419071376
Num Topics = 17, Coherence Value = 0.5597640008931348
Num Topics = 18, Coherence Value = 0.5627286264028141
Num Topics = 19, Coherence Value = 0.5324760418370857
Num Topics = 20, Coherence Value = 0.5697254486079693
Num Topics = 21, Coherence Value = 0.555529358403432
Num Topics = 22, Coherence Value = 0.5416557082842275
Num Topics = 23, Coherence Value = 0.553628886182025
Num Topics = 24, Coherence Value = 0.5285945635981711
Num Topics = 25, Coherence Value = 0.5440008211992358
Num Topics = 26, Coherence Value = 0.5315532263058409
Num Topics = 27, Coherence Value = 0.5355586288048415
Num Topics = 28, Coherence Val

Unnamed: 0,text,clean_text,keywords,topics,dominant_topic,sentiment
0,anything for us who viewed the livestream?? 🤣,u viewed livestream,"[livestream, viewed]","[(0, 0.042706795), (1, 0.04270698), (2, 0.0427...",4,0.0
1,"Not sure, probably owners of nouns",owner noun,"[noun, owner]","[(0, 0.36656097), (1, 0.03334871), (2, 0.03334...",5,-0.2411
2,whats the critera there?\n\n678 $DEGEN,critera degen,"[degen, critera]","[(0, 0.04918877), (1, 0.04918892), (2, 0.04918...",8,0.0
3,Thanks for feedback!\n\nThe system is working ...,feedback system working intended bot authority...,"[fall, immediately, overrun, engagement, total...","[(3, 0.040195893), (5, 0.82739955), (7, 0.1097...",5,0.902
4,lets go!,go,[go],"[(0, 0.050000113), (1, 0.050000113), (2, 0.050...",4,0.0
5,mr.japan when he sees this cast,see,[see],"[(0, 0.050088435), (1, 0.050088435), (2, 0.050...",4,0.0
6,thanks jacek for the 100$ (soon),jacek soon,"[soon, jacek]","[(0, 0.03335287), (1, 0.03335287), (2, 0.03335...",5,0.4404
7,Let's get on an audio space or unlonely and di...,audio space unlonely discus debate cc,"[cc, debate, discus, unlonely, space, audio]","[(0, 0.014303363), (1, 0.014303363), (2, 0.014...",8,0.0
8,"looks good, my fav combi!! But there should be...",look fav combi mustard sauce,"[sauce, mustard, combi, fav, look]","[(0, 0.018991807), (1, 0.018991822), (2, 0.018...",6,0.5894
9,Anyone can build a faster bidding experience o...,build faster bidding experience outside frame ...,"[onchain, open, code, frame, outside, experien...","[(0, 0.010002427), (1, 0.10985512), (2, 0.0100...",9,0.0


Unnamed: 0,Topic,Frequency,Average_Sentiment
0,0,3489,0.074945
1,2,3252,0.574939
2,8,2530,0.183463
3,6,1786,0.172333
4,1,1725,0.123084
5,4,1631,0.065668
6,3,1202,0.11996
7,5,1121,0.184763
8,7,971,0.297931
9,9,722,0.172164


Unnamed: 0,Keyword,Frequency,Average_Sentiment
0,degen,2503,0.133011
1,channel,1939,0.733227
2,game,1884,0.009555
3,result,1854,0.002792
4,momentarily,1845,0.0
5,play,1839,0.748029
6,match,1827,0.75469
7,dudegen,1499,0.188443
8,tip,1312,0.367999
9,farther,1199,0.039443


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Displaying the topic frequency DataFrame for recasts
display(topic_freq_recasts)

Unnamed: 0,Topic,Frequency,Average_Sentiment
0,0,3489,0.074945
1,2,3252,0.574939
2,8,2530,0.183463
3,6,1786,0.172333
4,1,1725,0.123084
5,4,1631,0.065668
6,3,1202,0.11996
7,5,1121,0.184763
8,7,971,0.297931
9,9,722,0.172164


In [None]:
# Displaying the keywords dataframe for recasts
display(keyword_counts_recasts.head(50))

Unnamed: 0,Keyword,Frequency,Average_Sentiment
0,degen,2503,0.133011
1,channel,1939,0.733227
2,game,1884,0.009555
3,result,1854,0.002792
4,momentarily,1845,0.0
5,play,1839,0.748029
6,match,1827,0.75469
7,dudegen,1499,0.188443
8,tip,1312,0.367999
9,farther,1199,0.039443


# **STEP 5: Casts From TOP20 KOLs - Last 180 Days**

In [None]:
# Fetching the latest result of the query
query_id = 3973614 # casts_KOLs_TOP180 to 91 days
query_result = dune.get_latest_result(query_id)

# Accessing the rows attribute to get the data
result_data = None

# Checking if 'rows' contains data
if hasattr(query_result.result, 'rows'):
    result_data = query_result.result.rows
else:
    print("No 'rows' attribute found in the result.")

# Converting result_data to DataFrame
if result_data:
    # Checking if result_data is directly a list of dictionaries
    if isinstance(result_data, list) and len(result_data) > 0 and isinstance(result_data[0], dict):
        casts_KOLTOP20_180_to_91days = pd.DataFrame(result_data)
    elif isinstance(result_data, dict) and 'data' in result_data:
        casts_KOLTOP20_180_to_91days = pd.DataFrame(result_data['data'])
    else:
        print("Unsupported data format:", type(result_data))
else:
    print("No data available or unable to access data.")

# Displaying the DataFrame
if 'casts_KOLTOP20_180_to_91days' in locals():
    print(casts_KOLTOP20_180_to_91days.head())
else:
    print("Failed to create DataFrame.")



            created_at  days_since_creation  \
0  2024-05-08 05:28:11                   91   
1  2024-03-15 09:27:46                  145   
2  2024-03-16 05:20:58                  144   
3  2024-03-16 13:25:54                  144   
4  2024-03-16 13:27:37                  144   

                                              embeds     fid  \
0                                                 []  247143   
1  [{"url": "https://drakula.app/post/2a50a7e5-08...   18471   
2                                                 []       3   
3  [{"url": "https://stream.warpcast.com/v1/video...  281836   
4  [{"url": "https://hyperloot-dungeon-war-result...  247143   

                                         hash  month parent_fid parent_hash  \
0  0x37c9d327049d9842d78231a37f25d89cb33e4b00      5       None        None   
1  0x3f8d4eaf1643bddd1dd0db3fb6406ae04143e658      3       None        None   
2  0x6fba7c05004654311992a7f32a3de6fa7dcde20b      3       None        None   
3  0x73f926b3d71

In [None]:
# Producing a basic description of the dataset and respective variables
print("\n\033[1m\033[30mColumns in the DataFrame:\033[0m\n")
print(casts_KOLTOP20_180_to_91days.columns)
print("\n\033[1m\033[30mCases and columns in the DataFrame:\033[0m\n")
print (casts_KOLTOP20_180_to_91days.shape)
print("\n\033[1m\033[30mColumn datatypes list:\033[0m\n")
casts_KOLTOP20_180_to_91days.info()
print("\n\033[1m\033[30mMissing values:\033[0m\n")
print(casts_KOLTOP20_180_to_91days.isna().sum())

# Printing main descriptive statistics
print("\n\033[1m\033[30mVariable main descriptives:\033[0m\n")
casts_KOLTOP20_180_to_91days.describe()


[1m[30mColumns in the DataFrame:[0m

Index(['created_at', 'days_since_creation', 'embeds', 'fid', 'hash', 'month',
       'parent_fid', 'parent_hash', 'parent_url', 'text', 'year'],
      dtype='object')

[1m[30mCases and columns in the DataFrame:[0m

(4913, 11)

[1m[30mColumn datatypes list:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4913 entries, 0 to 4912
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   created_at           4913 non-null   object
 1   days_since_creation  4913 non-null   int64 
 2   embeds               4913 non-null   object
 3   fid                  4913 non-null   int64 
 4   hash                 4913 non-null   object
 5   month                4913 non-null   int64 
 6   parent_fid           0 non-null      object
 7   parent_hash          0 non-null      object
 8   parent_url           3017 non-null   object
 9   text                 4913 non-null   obje

Unnamed: 0,days_since_creation,fid,month,year
count,4913.0,4913.0,4913.0,4913.0
mean,130.233666,110094.68105,3.454916,2024.0
std,25.318741,128988.559741,0.919419,0.0
min,91.0,3.0,2.0,2024.0
25%,109.0,99.0,3.0,2024.0
50%,126.0,18471.0,4.0,2024.0
75%,153.0,247143.0,4.0,2024.0
max,179.0,281836.0,5.0,2024.0


In [None]:
# Fetching the latest result of the query for casts_KOLTOP20_90_to_31days
query_id_90_to_31days = 3964062  # casts_KOLs_TOP20_90 to 31 days
query_result_90_to_31days = dune.get_latest_result(query_id_90_to_31days)

# Accessing the rows attribute to get the data for casts_KOLTOP20_90_to_31days
result_data_90_to_31days = None

# Checking if 'rows' contains data
if hasattr(query_result_90_to_31days.result, 'rows'):
    result_data_90_to_31days = query_result_90_to_31days.result.rows
else:
    print("No 'rows' attribute found in the result.")

# Converting result_data to DataFrame
if result_data_90_to_31days:
    # Checking if result_data is directly a list of dictionaries
    if isinstance(result_data_90_to_31days, list) and len(result_data_90_to_31days) > 0 and isinstance(result_data_90_to_31days[0], dict):
        casts_KOLTOP20_90_to_31days = pd.DataFrame(result_data_90_to_31days)
    elif isinstance(result_data_90_to_31days, dict) and 'data' in result_data_90_to_31days:
        casts_KOLTOP20_90_to_31days = pd.DataFrame(result_data_90_to_31days['data'])
    else:
        print("Unsupported data format:", type(result_data_90_to_31days))
else:
    print("No data available or unable to access data.")

# Displaying the DataFrame
if 'casts_KOLTOP20_90_to_31days' in locals():
    print(casts_KOLTOP20_90_to_31days.head())
else:
    print("Failed to create DataFrame.")

# Checking if all dataframes are available
if 'casts_KOLTOP20_30days' in locals() and 'casts_KOLTOP20_90_to_31days' in locals() and 'casts_KOLTOP20_180_to_91days' in locals():
    casts_KOLTOP20_180days = pd.concat([casts_KOLTOP20_30days, casts_KOLTOP20_90_to_31days, casts_KOLTOP20_180_to_91days], ignore_index=True)
    print(casts_KOLTOP20_180days.head())
else:
    print("One or more of the dataframes are not available.")

            created_at  days_since_creation  \
0  2024-06-09 03:00:14                   59   
1  2024-06-09 09:28:48                   59   
2  2024-06-09 15:54:20                   59   
3  2024-06-09 16:02:14                   59   
4  2024-06-09 18:16:55                   59   

                                              embeds     fid  \
0                                                 []  247143   
1  [{"castId": {"fid": 491391, "hash": {"data": [...  562300   
2  [{"url": "https://imagedelivery.net/BXluQx4ige...  617632   
3  [{"url": "https://perl.xyz/market/235/play-ver...    1110   
4  [{"url": "https://imagedelivery.net/BXluQx4ige...  618823   

                                         hash  month parent_fid parent_hash  \
0  0x640ef21e3337a085dad5a3199b4589dfe33521d4      6       None        None   
1  0xff03e6cde14b2b957d0ea6f73e7d458172c16cf3      6       None        None   
2  0x31c5c875c9ffcce6b962e7eca1f7be499a0b17f4      6       None        None   
3  0xe6015522f61

In [None]:
# Producing a basic description of the dataset and respective variables
print("\n\033[1m\033[30mColumns in the DataFrame:\033[0m\n")
print(casts_KOLTOP20_180days.columns)
print("\n\033[1m\033[30mCases and columns in the DataFrame:\033[0m\n")
print (casts_KOLTOP20_180days.shape)
print("\n\033[1m\033[30mColumn datatypes list:\033[0m\n")
casts_KOLTOP20_180days.info()
print("\n\033[1m\033[30mMissing values:\033[0m\n")
print(casts_KOLTOP20_180days.isna().sum())

# Printing main descriptive statistics
print("\n\033[1m\033[30mVariable main descriptives:\033[0m\n")
casts_KOLTOP20_180days.describe()


[1m[30mColumns in the DataFrame:[0m

Index(['created_at', 'days_since_creation', 'embeds', 'fid', 'hash', 'month',
       'parent_fid', 'parent_hash', 'parent_url', 'text', 'year', 'clean_text',
       'keywords', 'topics', 'dominant_topic', 'sentiment'],
      dtype='object')

[1m[30mCases and columns in the DataFrame:[0m

(31164, 16)

[1m[30mColumn datatypes list:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31164 entries, 0 to 31163
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   created_at           31164 non-null  object 
 1   days_since_creation  31164 non-null  int64  
 2   embeds               31164 non-null  object 
 3   fid                  31164 non-null  int64  
 4   hash                 31164 non-null  object 
 5   month                31164 non-null  int64  
 6   parent_fid           0 non-null      object 
 7   parent_hash          0 non-null      object 
 8   par

Unnamed: 0,days_since_creation,fid,month,year,dominant_topic,sentiment
count,31164.0,31164.0,31164.0,31164.0,3944.0,3944.0
mean,52.762194,479828.936048,5.993005,2024.0,4.045892,0.380277
std,37.544188,266980.566932,1.301739,0.0,3.090104,0.466851
min,0.0,3.0,2.0,2024.0,0.0,-0.9935
25%,34.0,247143.0,6.0,2024.0,2.0,0.0
50%,39.0,657052.0,6.0,2024.0,3.0,0.5267
75%,52.0,657052.0,7.0,2024.0,7.0,0.7772
max,179.0,784003.0,8.0,2024.0,10.0,0.9936


In [None]:
# Preprocessing the text
def preprocess_text(text):
    # Removing URLs
    text = re.sub(r'http\S+', '', text)
    stop_words = set(stopwords.words('english'))
    stop_words.update(custom_stop_words)  # Using the already defined custom stop words
    wordnet_lemmatizer = WordNetLemmatizer()

    # Tokenizing and removing stop words
    words = word_tokenize(text.lower())
    words = [word for word in words if word.isalnum() and word not in stop_words and not word.isdigit()]

    # Lemmatizing words
    words = [wordnet_lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

# Preprocessing text for casts_KOLTOP20_180days
casts_KOLTOP20_180days['clean_text'] = casts_KOLTOP20_180days['text'].apply(preprocess_text)

# Extracting Keywords using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(casts_KOLTOP20_180days['clean_text'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

keywords = []
for row in tfidf_matrix:
    keywords.append([tfidf_feature_names[i] for i in row.nonzero()[1]])

casts_KOLTOP20_180days['keywords'] = keywords

# Converting documents to a corpus of word lists
texts_casts = [text.split() for text in casts_KOLTOP20_180days['clean_text']]
dictionary_casts = corpora.Dictionary(texts_casts)
corpus_casts = [dictionary_casts.doc2bow(text) for text in texts_casts]

# Defining parameters for topic model optimization
limit = 30  # maximum number of topics
start = 10   # minimum number of topics
step = 1    # step size

# Determining the optimal number of topics using coherence scores
model_list_casts, coherence_values_casts = compute_coherence_values(dictionary_casts, corpus_casts, texts_casts, limit, start, step)

# Selecting the model with the highest coherence score
optimal_model_casts = model_list_casts[coherence_values_casts.index(max(coherence_values_casts))]

# Displaying coherence scores
for m, cv in zip(range(start, limit, step), coherence_values_casts):
    print(f"Num Topics = {m}, Coherence Value = {cv}")

# Saving optimal number of topics
optimal_num_topics_casts = start + coherence_values_casts.index(max(coherence_values_casts))
print(f"Optimal number of topics for casts_KOLTOP20_180days: {optimal_num_topics_casts}")

# Applying LDA with the optimal number of topics
lda_model_casts = gensim.models.LdaModel(corpus_casts, num_topics=optimal_num_topics_casts, id2word=dictionary_casts, passes=10, random_state=42)
topics_casts = lda_model_casts.print_topics(num_words=5)

# Assigning topics to documents
def get_document_topics(text):
    bow = dictionary_casts.doc2bow(text.split())
    document_topics = lda_model_casts.get_document_topics(bow)
    return document_topics

casts_KOLTOP20_180days['topics'] = casts_KOLTOP20_180days['clean_text'].apply(lambda text: get_document_topics(text))

# Extracting the dominant topic for each document
casts_KOLTOP20_180days['dominant_topic'] = casts_KOLTOP20_180days['topics'].apply(get_dominant_topic)

# Calculating the frequency of each topic
topic_freq_casts = casts_KOLTOP20_180days['dominant_topic'].value_counts().reset_index()
topic_freq_casts.columns = ['Topic', 'Frequency']

# Performing Sentiment Analysis
sid = SentimentIntensityAnalyzer()
casts_KOLTOP20_180days['sentiment'] = casts_KOLTOP20_180days['text'].apply(lambda x: sid.polarity_scores(x)['compound'])

# Reordering columns
casts_KOLTOP20_180days = casts_KOLTOP20_180days[['text', 'clean_text', 'keywords', 'topics', 'dominant_topic', 'sentiment', 'month', 'year']]

# Calculating average sentiment for each topic
topic_sentiment_casts = casts_KOLTOP20_180days.groupby('dominant_topic')['sentiment'].mean().reset_index()
topic_sentiment_casts.columns = ['Topic', 'Average_Sentiment']

# Merging with topic frequency
topic_freq_casts = pd.merge(topic_freq_casts, topic_sentiment_casts, on='Topic')

# Displaying the resulting DataFrame
display(casts_KOLTOP20_180days.head(10))

# Performing Keyword Analysis
# Flattening the list of keywords and count the frequencies
all_keywords_casts = [keyword for sublist in casts_KOLTOP20_180days['keywords'] for keyword in sublist]
keyword_counts_casts = pd.Series(all_keywords_casts).value_counts().reset_index()
keyword_counts_casts.columns = ['Keyword', 'Frequency']
keyword_counts_casts = keyword_counts_casts.sort_values(by='Frequency', ascending=False)

# Calculating average sentiment for each keyword
keyword_sentiment_list_casts = []
for keyword in keyword_counts_casts['Keyword']:
    avg_sentiment = casts_KOLTOP20_180days[casts_KOLTOP20_180days['clean_text'].str.contains(keyword)]['sentiment'].mean()
    keyword_sentiment_list_casts.append(avg_sentiment)

keyword_counts_casts['Average_Sentiment'] = keyword_sentiment_list_casts

# Saving keyword analysis and topic modeling results to CSV
keyword_counts_casts.to_csv('casts_keyword_analysis_180days.csv', index=False)
topic_freq_casts.to_csv('casts_topic_modeling_results_180days.csv', index=False)

# Displaying the topic frequencies
display(topic_freq_casts)

# Displaying top keywords
display(keyword_counts_casts.head(50))

# Downloading the CSV files
from google.colab import files
files.download('casts_keyword_analysis_180days.csv')
files.download('casts_topic_modeling_results_180days.csv')

Num Topics = 10, Coherence Value = 0.4421975075559432
Num Topics = 11, Coherence Value = 0.47430658947694043
Num Topics = 12, Coherence Value = 0.46464277838844636
Num Topics = 13, Coherence Value = 0.3999446687848019
Num Topics = 14, Coherence Value = 0.43902747966512434
Num Topics = 15, Coherence Value = 0.45441825381176654
Num Topics = 16, Coherence Value = 0.42363551944017086
Num Topics = 17, Coherence Value = 0.45385062496806144
Num Topics = 18, Coherence Value = 0.44218296433640575
Num Topics = 19, Coherence Value = 0.42505554436847987
Num Topics = 20, Coherence Value = 0.4487014440893143
Num Topics = 21, Coherence Value = 0.4322114831092442
Num Topics = 22, Coherence Value = 0.4397109085277757
Num Topics = 23, Coherence Value = 0.4291684409047429
Num Topics = 24, Coherence Value = 0.4353422819335675
Num Topics = 25, Coherence Value = 0.4436779541736983
Num Topics = 26, Coherence Value = 0.42727608020585767
Num Topics = 27, Coherence Value = 0.4255880942694127
Num Topics = 28, Co

Unnamed: 0,text,clean_text,keywords,topics,dominant_topic,sentiment,month,year
0,Social media user reaction to this story in Pa...,medium reaction story pakistan extremely onlin...,"[true, pakistani, online, extremely, pakistan,...","[(0, 0.010113895), (1, 0.6765779), (2, 0.01011...",1,0.7906,8,2024
1,yall know what time it is\n\ngather around lem...,yall gather lem tell yall story,"[tell, lem, gather, yall, story]","[(0, 0.18151367), (1, 0.015122856), (2, 0.0151...",5,0.0,8,2024
2,https://yo-dudes.vercel.app/api,,[],"[(0, 0.09090909), (1, 0.09090909), (2, 0.09090...",0,0.0,8,2024
3,Have you checked the $hunt-tip ranks today?\n\...,checked rank top favorite builder list clappin...,"[tipping, clapping, list, builder, favorite, t...","[(0, 0.1007799), (6, 0.09513068), (7, 0.738004...",7,0.6239,8,2024
4,"#18,000 NFT that created in Mint Club 🥳 🎉 \n\n...",nft mint club,"[club, mint, nft]","[(0, 0.022727573), (1, 0.022727573), (2, 0.022...",7,0.25,8,2024
5,"According to CoinGecko, the circulating supply...",according coingecko circulating supply degen b...,"[possible, growth, plenty, million, cap, marke...","[(0, 0.3104063), (1, 0.175083), (7, 0.29574928...",0,0.3818,8,2024
6,"We're hammering, coding, and putting the final...",hammering coding putting final touch soon able...,"[chain, across, asset, purchase, able, soon, t...","[(0, 0.1698794), (2, 0.25603107), (6, 0.083920...",8,0.2481,8,2024
7,has challenged to a battle!\nComplete the ba...,battle battle,[battle],"[(0, 0.030303065), (1, 0.030303065), (2, 0.030...",5,-0.7339,8,2024
8,dudes fan tokens going brrrrrr,dude fan token brrrrrr,"[brrrrrr, token, fan, dude]","[(0, 0.022694035), (1, 0.022694036), (2, 0.022...",7,0.3182,8,2024
9,I just bid for 's Fan Tokens powered by cc,bid fan token powered cc,"[cc, powered, bid, token, fan]","[(0, 0.015152934), (1, 0.015152934), (2, 0.015...",7,0.3182,8,2024


Unnamed: 0,Topic,Frequency,Average_Sentiment
0,4,18762,0.77453
1,0,2335,0.209942
2,9,1746,0.33887
3,8,1700,0.2595
4,7,1338,0.344999
5,5,1012,-0.338608
6,1,1008,0.177349
7,6,978,0.197444
8,2,978,0.249079
9,10,703,0.183201


Unnamed: 0,Keyword,Frequency,Average_Sentiment
0,channel,19162,0.76633
1,game,18905,0.772457
2,play,18823,0.774206
3,player,18744,0.776596
4,degen,1016,0.393263
5,farcaster,953,0.379684
6,battle,606,-0.663538
7,frame,577,0.376406
8,onchain,496,0.383972
9,mint,471,0.375876


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Displaying the topic frequency DataFrame for casts - 180 days
display(topic_freq_casts)

Unnamed: 0,Topic,Frequency,Average_Sentiment
0,4,18762,0.77453
1,0,2335,0.209942
2,9,1746,0.33887
3,8,1700,0.2595
4,7,1338,0.344999
5,5,1012,-0.338608
6,1,1008,0.177349
7,6,978,0.197444
8,2,978,0.249079
9,10,703,0.183201


In [None]:
# Displaying the keywords dataframe for casts - 180days
display(keyword_counts_casts.head(50))

Unnamed: 0,Keyword,Frequency,Average_Sentiment
0,channel,19162,0.76633
1,game,18905,0.772457
2,play,18823,0.774206
3,player,18744,0.776596
4,degen,1016,0.393263
5,farcaster,953,0.379684
6,battle,606,-0.663538
7,frame,577,0.376406
8,onchain,496,0.383972
9,mint,471,0.375876


In [None]:
# Ensuring the 'month' and 'year' columns are in integer format
casts_KOLTOP20_180days['month'] = casts_KOLTOP20_180days['month'].astype(int)
casts_KOLTOP20_180days['year'] = casts_KOLTOP20_180days['year'].astype(int)

# Creating 'month_year' period column
casts_KOLTOP20_180days['month_year'] = pd.to_datetime(casts_KOLTOP20_180days[['year', 'month']].assign(day=1))

# Filtering for the last 6 months
last_six_months = casts_KOLTOP20_180days['month_year'].max() - pd.DateOffset(months=6)
casts_KOLTOP20_180days_last_6_months = casts_KOLTOP20_180days[casts_KOLTOP20_180days['month_year'] > last_six_months]

# Getting the top 200 keywords
top_200_keywords = keyword_counts_casts.head(200)['Keyword'].tolist()

# Initializing a dictionary to store keyword frequencies by month
keyword_frequencies = {keyword: [] for keyword in top_200_keywords}

# Getting unique periods in the last 6 months and sort them
unique_periods = sorted(casts_KOLTOP20_180days_last_6_months['month_year'].unique())

# Calculating keyword frequencies by month
for keyword in top_200_keywords:
    for period in unique_periods:
        month_data = casts_KOLTOP20_180days_last_6_months[casts_KOLTOP20_180days_last_6_months['month_year'] == period]
        month_text = ' '.join(month_data['clean_text'])
        month_words = month_text.split()
        keyword_frequencies[keyword].append(month_words.count(keyword))

# Converting the dictionary to a DataFrame
keyword_frequencies_df = pd.DataFrame(keyword_frequencies, index=unique_periods)

# Transposing the DataFrame to have keywords as rows and periods as columns
keyword_frequencies_df = keyword_frequencies_df.transpose()

# Printing the resulting DataFrame
print(keyword_frequencies_df)

# Saving the resulting DataFrame to a CSV file
keyword_frequencies_df.to_csv('keyword_frequencies_by_month.csv', index=True)

# Downloading the CSV file
files.download('keyword_frequencies_by_month.csv')

            2024-03-01  2024-04-01  2024-05-01  2024-06-01  2024-07-01  \
channel             68         102         220        9387        9509   
game                17          22          18       18565       18867   
play                36          23          18        9293        9417   
player               1           6          17        9242        9407   
degen              252         465         247         177         127   
...                ...         ...         ...         ...         ...   
24h                  0          15          12          18           9   
congrats            12          21          19           9           4   
experience           5           7          12          13          24   
ai                  22           7          15          19           6   
look                 5           9          15          16          18   

            2024-08-01  
channel             87  
game                87  
play                41  
player     

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Defining the list of words/phrases to count (case-insensitive)
words_to_count = ["Consensys", "MetaMask", "Infura", "Linea", "Ethereum", "eth"]

# Initializing a dictionary to store word frequencies by month
word_frequencies = {word.lower(): [] for word in words_to_count}
word_frequencies["meta mask"] = []  # Add an entry for "meta mask"

# Calculating word frequencies by month
for word in words_to_count:
    word_lower = word.lower()
    for period in unique_periods:
        month_data = casts_KOLTOP20_180days_last_6_months[casts_KOLTOP20_180days_last_6_months['month_year'] == period]
        month_text = ' '.join(month_data['text']).lower()  # Convert the text to lowercase
        word_frequencies[word_lower].append(month_text.count(word_lower))

        # Count "meta mask"
        if word_lower == "metamask":  # Only check for "meta mask" when counting "MetaMask"
            meta_mask_count = month_text.count("meta mask")
            word_frequencies["meta mask"].append(meta_mask_count)

# Converting the dictionary to a DataFrame
word_frequencies_df = pd.DataFrame(word_frequencies, index=unique_periods)

# Transposing the DataFrame to have words as rows and periods as columns
word_frequencies_df = word_frequencies_df.transpose()

# Printing the resulting DataFrame
print(word_frequencies_df)

# Saving the resulting DataFrame to a CSV file
word_frequencies_df.to_csv('word_frequencies_by_month.csv', index=True)

# Downloading the CSV file
files.download('word_frequencies_by_month.csv')

           2024-03-01  2024-04-01  2024-05-01  2024-06-01  2024-07-01  \
consensys           0           0           1           0           0   
metamask            0           0           0           2           5   
infura              0           0           0           0           0   
linea               1           0           2           1           0   
ethereum           21          10           8          20          30   
eth                85          95          96         157         187   
meta mask           0           0           0           0           0   

           2024-08-01  
consensys           0  
metamask            0  
infura              0  
linea               1  
ethereum            1  
eth                54  
meta mask           0  


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **STEP 6: Topic Evaluation With GEN AI (Pretrained Transformers)**

In [None]:
# Loading the T5 model and tokenizer
t5_model_name = 't5-small'
t5_model = T5ForConditionalGeneration.from_pretrained(t5_model_name)
t5_tokenizer = T5Tokenizer.from_pretrained(t5_model_name)

# Loading the BART model and tokenizer
bart_model_name = 'facebook/bart-large-cnn'
bart_model = BartForConditionalGeneration.from_pretrained(bart_model_name)
bart_tokenizer = BartTokenizer.from_pretrained(bart_model_name)

# Loading the Pegasus model and tokenizer
pegasus_model_name = 'google/pegasus-xsum'
pegasus_model = PegasusForConditionalGeneration.from_pretrained(pegasus_model_name)
pegasus_tokenizer = PegasusTokenizer.from_pretrained(pegasus_model_name)

# Defining the function to summarize text using T5
def t5_summarize(text_list, max_length=50, min_length=10):
    text = ' '.join(text_list)  # Join all texts into a single string
    inputs = t5_tokenizer.encode("summarize in a full sentence: " + text, return_tensors="pt", max_length=512, truncation=True)
    outputs = t5_model.generate(inputs, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

# Defining the function to summarize text using BART
def bart_summarize(text_list, max_length=50, min_length=10):
    text = ' '.join(text_list)  # Join all texts into a single string
    inputs = bart_tokenizer.encode("summarize in a full sentence: " + text, return_tensors="pt", max_length=512, truncation=True)
    outputs = bart_model.generate(inputs, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = bart_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

# Defining the function to summarize text using Pegasus
def pegasus_summarize(text_list, max_length=50, min_length=10):
    text = ' '.join(text_list)  # Join all texts into a single string
    inputs = pegasus_tokenizer.encode("summarize in a full sentence: " + text, return_tensors="pt", max_length=512, truncation=True)
    outputs = pegasus_model.generate(inputs, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = pegasus_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

# For casts_KOLTOP20_30days
# Grouping by dominant_topic and concatenate texts
grouped_casts_KOLTOP20_30days = casts_KOLTOP20_30days.groupby('dominant_topic')['text'].apply(list).reset_index()
# Apply summarization to each group for suggested topic and summary using three models
grouped_casts_KOLTOP20_30days['t5_suggested_topic'] = grouped_casts_KOLTOP20_30days['text'].apply(lambda x: t5_summarize(x, max_length=50, min_length=10))
grouped_casts_KOLTOP20_30days['t5_summary'] = grouped_casts_KOLTOP20_30days['text'].apply(lambda x: t5_summarize(x, max_length=100, min_length=50))
grouped_casts_KOLTOP20_30days['bart_suggested_topic'] = grouped_casts_KOLTOP20_30days['text'].apply(lambda x: bart_summarize(x, max_length=50, min_length=10))
grouped_casts_KOLTOP20_30days['bart_summary'] = grouped_casts_KOLTOP20_30days['text'].apply(lambda x: bart_summarize(x, max_length=100, min_length=50))
grouped_casts_KOLTOP20_30days['pegasus_suggested_topic'] = grouped_casts_KOLTOP20_30days['text'].apply(lambda x: pegasus_summarize(x, max_length=50, min_length=10))
grouped_casts_KOLTOP20_30days['pegasus_summary'] = grouped_casts_KOLTOP20_30days['text'].apply(lambda x: pegasus_summarize(x, max_length=100, min_length=50))

# For recasts_KOLTOP20_30days
# Grouping by dominant_topic and concatenate texts
grouped_recasts_KOLTOP20_30days = recasts_KOLTOP20_30days.groupby('dominant_topic')['text'].apply(list).reset_index()
# Apply summarization to each group for suggested topic and summary using three models
grouped_recasts_KOLTOP20_30days['t5_suggested_topic'] = grouped_recasts_KOLTOP20_30days['text'].apply(lambda x: t5_summarize(x, max_length=50, min_length=10))
grouped_recasts_KOLTOP20_30days['t5_summary'] = grouped_recasts_KOLTOP20_30days['text'].apply(lambda x: t5_summarize(x, max_length=100, min_length=50))
grouped_recasts_KOLTOP20_30days['bart_suggested_topic'] = grouped_recasts_KOLTOP20_30days['text'].apply(lambda x: bart_summarize(x, max_length=50, min_length=10))
grouped_recasts_KOLTOP20_30days['bart_summary'] = grouped_recasts_KOLTOP20_30days['text'].apply(lambda x: bart_summarize(x, max_length=100, min_length=50))
grouped_recasts_KOLTOP20_30days['pegasus_suggested_topic'] = grouped_recasts_KOLTOP20_30days['text'].apply(lambda x: pegasus_summarize(x, max_length=50, min_length=10))
grouped_recasts_KOLTOP20_30days['pegasus_summary'] = grouped_recasts_KOLTOP20_30days['text'].apply(lambda x: pegasus_summarize(x, max_length=100, min_length=50))

# Combining both DataFrames into one for CSV export
combined_df = pd.concat([grouped_casts_KOLTOP20_30days, grouped_recasts_KOLTOP20_30days])

# Adding titles to the tables
grouped_casts_KOLTOP20_30days.title = "Casts"
grouped_recasts_KOLTOP20_30days.title = "Recasts"

# Displaying the resulting DataFrames
display(grouped_casts_KOLTOP20_30days.head(), grouped_recasts_KOLTOP20_30days.head())

# Saving to CSV and provide download link
combined_df.to_csv('combined_topic_summaries.csv', index=False)

# Downloading the CSV file
from google.colab import files
files.download('combined_topic_summaries.csv')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

Unnamed: 0,dominant_topic,text,t5_suggested_topic,t5_summary,bart_suggested_topic,bart_summary,pegasus_suggested_topic,pegasus_summary
0,0,"[https://yo-dudes.vercel.app/api, #18,000 NFT ...",if the more powerful accounts don't engage wit...,if the more powerful accounts don't engage wit...,summarize in a full sentence: https://yo-dudes...,summarize in a full sentence: https://yo-dudes...,If you want to learn more about how to build a...,If you want to learn more about how to build a...
1,1,[Have you checked the $hunt-tip ranks today?\n...,"if you're a rational person, you have to be wi...","if you're a rational person, you have to be wi...",The $hunt-tip leaderboard shows who's winning ...,The $hunt-tip leaderboard shows who's winning ...,"In case you missed it, here's a round-up of so...","In case you missed it, here's a round-up of so..."
2,2,"[Nice, ! You've created a game!\nI will let yo...",I will let you know once you've matched with a...,I will let you know once you've matched with a...,"Summarize in a full sentence: Nice,! You've cr...","Summarize in a full sentence: Nice,! You've cr...","Nice,! Nice,! Nice,! Nice,!",If you're a Moxie Maxi please amplify https://...
3,3,[I just bid for 's Fan Tokens powered by cc ...,I just bought's Fan Tokens powered by cc based...,I just bought's Fan Tokens powered by cc based...,Tim Walz came across like an inspirational HS ...,Tim Walz came across like an inspirational HS ...,I just bid for's Fan Tokens powered by cc base...,I just bid for's Fan Tokens powered by cc base...
4,4,"[Newton's Third Law of Motion, which states th...",you can check out who’s leading the top-list. ...,you can check out who’s leading the top-list. ...,I do yoga at least once a week. In ideal perio...,I do yoga at least once a week. In ideal perio...,In our series of letters from African journali...,In our series of letters from African journali...


Unnamed: 0,dominant_topic,text,t5_suggested_topic,t5_summary,bart_suggested_topic,bart_summary,pegasus_suggested_topic,pegasus_summary
0,0,"[, You've matched with !\nGame results coming ...","has won with Joker,12,8,3,1 vs.'s Joker,10,7,4...","has won with Joker,12,8,3,1 vs.'s Joker,10,7,4...",This is a very big step. We will make it self ...,This is a very big step. We will make it self ...,"You've matched with a Joker, and now you can w...","You've matched with a Joker, and now you've wo..."
1,1,"[overtipping is not a /bug, its a feature, gre...","ya başka bir cüzdana giriştikten sonra ""Fan To...",ya başka bir cüzdana giriştikten sonra “Fan To...,Summarize in a full sentence: overtipping is n...,Summarize in a full sentence: overtipping is n...,"777 $hunt thanks boss, good day to you! 777 $h...","777 $hunt thanks boss, good day to you! 100 $h..."
2,2,[Mostly letting others play the auctions. Bidd...,"'s 10,7,5 and's 7,7,7 Play again or find a mat...","'s 10,7,5 and's 7,7,7 Play again or find a mat...",300 $PERL will be credited to your account in ...,Bidding here and there mostly for testing the ...,I'm not sure what you're talking about.,"I'm not sure what you're talking about, but I'..."
3,3,"[Lolol, FAFO, anyone can participate!, lfggggg...",gm ily gotta go watch deadpool first more meme...,gm ily gotta go watch deadpool first more meme...,summarize in a full sentence: Lolol FAFO anyon...,summarize in a full sentence: Lolol FAFO anyon...,We've been having a lot of fun answering your ...,We've been having a lot of fun answering your ...
4,4,[anything for us who viewed the livestream?? 🤣...,mr.japan sees this cast my man getting driven ...,mr.japan sees this cast my man getting driven ...,Summarize in a full sentence: anything for us ...,Summarize in a full sentence: anything for us ...,What do you want us to know about the LFG Lowe...,What do you want us to know about your experie...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# For casts_KOLTOP20_180days
# Grouping by dominant_topic and concatenate texts
grouped_casts_KOLTOP20_180days = casts_KOLTOP20_180days.groupby('dominant_topic')['text'].apply(list).reset_index()

# Applying summarization to each group for suggested topic and summary using three models
grouped_casts_KOLTOP20_180days['t5_suggested_topic'] = grouped_casts_KOLTOP20_180days['text'].apply(lambda x: t5_summarize(x, max_length=10, min_length=5))
grouped_casts_KOLTOP20_180days['t5_summary'] = grouped_casts_KOLTOP20_180days['text'].apply(lambda x: t5_summarize(x, max_length=100, min_length=50))
grouped_casts_KOLTOP20_180days['bart_suggested_topic'] = grouped_casts_KOLTOP20_180days['text'].apply(lambda x: bart_summarize(x, max_length=10, min_length=5))
grouped_casts_KOLTOP20_180days['bart_summary'] = grouped_casts_KOLTOP20_180days['text'].apply(lambda x: bart_summarize(x, max_length=100, min_length=50))
grouped_casts_KOLTOP20_180days['pegasus_suggested_topic'] = grouped_casts_KOLTOP20_180days['text'].apply(lambda x: pegasus_summarize(x, max_length=10, min_length=5))
grouped_casts_KOLTOP20_180days['pegasus_summary'] = grouped_casts_KOLTOP20_180days['text'].apply(lambda x: pegasus_summarize(x, max_length=100, min_length=50))

# Displaying the resulting DataFrame
display(grouped_casts_KOLTOP20_180days.head())

# Saving to CSV and provide download link
grouped_casts_KOLTOP20_180days.to_csv('grouped_casts_KOLTOP20_180days_topic_summaries.csv', index=False)

# Downloading the CSV file
files.download('grouped_casts_KOLTOP20_180days_topic_summaries.csv')

Unnamed: 0,dominant_topic,text,t5_suggested_topic,t5_summary,bart_suggested_topic,bart_summary,pegasus_suggested_topic,pegasus_summary
0,0,"[https://yo-dudes.vercel.app/api, According to...",the circulating supply of $DEGEN is,the circulating supply of $DEGEN is only 12.4 ...,summarize in a full sentence,summarize in a full sentence: https://yo-dudes...,The following is a selection of some of,The following is a selection of some of the be...
1,1,[Social media user reaction to this story in P...,nfa lmfa,"""the avoidable war"" is the persistent factor o...",Summarize in a full sentence,summarize in a full sentence: Social media use...,Some of the quirkier snippets from the,A look back at some of the quirkier snippets f...
2,2,[Day 6 of $hunt-tip Season 2!\n\nIt's still ea...,the AuthKit library is open source,the AuthKit library is open source -- anyone i...,The AuthKit library is open source,The AuthKit library is open source -- anyone i...,Day 6 of $hunt-tip Season,Day 6 of $hunt-tip Season 2 - what's the secre...
3,3,[There's sushi restaurants and then there's......,"mfers be like ""who sold","mfers be like ""who sold"" me: ""crypto doesn't h...",Farcaster is a weekly newsletter,Farcaster is a weekly newsletter from CNN Tech...,In Case You Missed It: a daily,In Case You Missed It: a round-up of interesti...
4,4,"[Nice, ! You've created a game!\nI will let yo...",I will let you know once you've,I will let you know once you've matched with a...,Summarize in a full sentence,"Summarize in a full sentence: Nice,! You've cr...",Check out the /card channel to find,Check out the /card channel to find games to p...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>