# CryptoProphet
## Notebook's Goal
> Filter tweets using Zero Shot Classification (with information about Crypto or Bitcoin)

In [5]:
# imports modules
from src.paths import LOCAL_RAW_DATA_PATH, LOCAL_PROCESSED_DATA_PATH, LOCAL_MODELS_PATH
from tqdm._tqdm_notebook import tqdm_notebook
from transformers import pipeline
import pandas as pd
import xgboost
import pickle

tqdm_notebook.pandas()

# loads data
df_path = LOCAL_PROCESSED_DATA_PATH / 'pretrain_dataset_20211013.pkl'
df = pd.read_pickle(df_path)

Please use `tqdm.notebook.*` instead of `tqdm._tqdm_notebook.*`
  This is separate from the ipykernel package so we can avoid doing imports until


In [6]:
# loads model
model = pipeline('zero-shot-classification')

In [7]:
# tests model
text = 'I wanna take a cab at 8 pm'
labels = ['Taxi', 'Appointment', 'Trip', 'Game']
model(text, labels, multi_label=True)

{'sequence': 'I wanna take a cab at 8 pm',
 'labels': ['Taxi', 'Trip', 'Game', 'Appointment'],
 'scores': [0.8777869939804077,
  0.6453558802604675,
  0.05611748248338699,
  0.04601442068815231]}

In [9]:
# predicts label confidence for all available text 
# BE CAREFUL runing this lines...
# (it took us over 18h to run this locally!!!)
labels = ['Crypto', 'Bitcoin']
df['zsc_classes'] = df.full_text.progress_apply(lambda x: model(x, labels, multi_label=True))

# extracts (safety data)
df['zsc_classes'].to_pickle(LOCAL_PROCESSED_DATA_PATH / 'zsc_raw_classes_BTC.pkl')

# filters threshold (> 0.65)
mask = df.zsc_classes.progress_apply(lambda x: max(x['scores']) > 0.65)
df[~mask].sample(15, random_state=2)

  0%|          | 0/92071 [00:00<?, ?it/s]

Unnamed: 0,created_at,created_at_trunc_h,id_str,full_text,retweet_count,favorite_count,user_screen_name,user_feat,BTC,DOGE,...,759,760,761,762,763,764,765,766,767,zsc_classes
29732,2020-12-11 14:08:59,2020-12-11 14:00:00,1337398982247641088,@ErmiyaK Agreed. I don’t use it either. But in...,0.0,1.0,coinbureau,32,0,0,...,0.049659,0.020229,0.04115,0.038924,0.030031,0.000111,-0.065163,0.100103,0.07028,{'sequence': '@ErmiyaK Agreed. I don’t use it ...
81231,2021-04-11 20:12:16,2021-04-11 20:00:00,1381339335677652992,@xrpstandard2 🤔,0.0,0.0,davidgokhshtein,34,0,0,...,0.097082,-0.196413,-0.202156,0.003059,-0.284737,0.124238,0.081949,0.101685,-0.060527,"{'sequence': '@xrpstandard2 🤔', 'labels': ['Cr..."
91090,2019-10-29 18:02:13,2019-10-29 18:00:00,1189241039074316288,Story of @youngdumbcrypto's life....\n\nDo you...,0.0,1.0,Coinbound_io,6,0,0,...,-0.033485,-0.078214,-0.054801,-0.079372,0.272323,0.017875,0.035701,-0.184386,0.113615,{'sequence': 'Story of @youngdumbcrypto's life...
60371,2021-03-16 17:38:50,2021-03-16 17:00:00,1371878638367301632,@DariusWilhite @investvoyager They are working...,0.0,0.0,CryptoWendyO,10,0,0,...,0.105006,-0.00615,-0.02724,-0.06503,-0.007967,-0.119518,0.024294,-0.067923,-0.103448,{'sequence': '@DariusWilhite @investvoyager Th...
80230,2020-09-05 03:20:18,2020-09-05 03:00:00,1302084110530351104,@RealCryptoV @ForSupplychain oooohhhhh a fat j...,0.0,0.0,Bitboy_Crypto,4,0,0,...,0.018724,0.02234,-0.056541,0.043452,0.415814,0.056469,0.042964,-0.079079,-0.158736,{'sequence': '@RealCryptoV @ForSupplychain ooo...
14512,2021-01-30 09:46:56,2021-01-30 09:00:00,1355452426875064320,https://t.co/pyRCFE97Xp,4268.0,118583.0,elonmusk,35,0,0,...,0.079822,0.078879,0.030853,0.010629,-0.101329,-0.055084,0.003691,0.189589,-0.026141,"{'sequence': 'https://t.co/pyRCFE97Xp', 'label..."
22100,2021-04-10 20:22:20,2021-04-10 20:00:00,1380979481309945856,"@engineers_feed Due to lower gravity, you can ...",1926.0,49462.0,elonmusk,35,0,0,...,-0.01656,-0.002077,-0.062541,0.120228,-0.321045,0.017718,-0.104404,-0.055482,0.087442,{'sequence': '@engineers_feed Due to lower gra...
71525,2021-04-27 19:26:46,2021-04-27 19:00:00,1387126091198894080,"@NinetyEightNHL It’s okay, I paid cash",0.0,1.0,PeterMcCormack,19,0,0,...,-0.091418,0.072264,0.211494,-0.071503,0.14196,0.178561,0.044941,-0.002074,-0.022271,"{'sequence': '@NinetyEightNHL It’s okay, I pai..."
54636,2021-04-12 19:28:03,2021-04-12 19:00:00,1381690595354230784,RT @stellabelle: @cryptopom1 @xanderatallah Ye...,1.0,0.0,KennethBosak,16,0,0,...,-0.035876,0.139444,-0.083664,-0.003169,-0.063957,0.021437,0.022833,-0.074307,-0.023318,{'sequence': 'RT @stellabelle: @cryptopom1 @xa...
22380,2021-04-08 14:25:28,2021-04-08 14:00:00,1380164897976115200,@GoingParabolic I'm just going to say good mor...,0.0,9.0,CryptoWendyO,10,0,0,...,-0.027255,0.146866,0.090171,0.018699,0.01489,0.111566,-0.075572,-0.098585,0.111741,{'sequence': '@GoingParabolic I'm just going t...


In [13]:
# prints tweets outside threshold score (<0.65) - shall be filtered out the analysis
for x in df[~mask].sample(15, random_state=1).full_text.to_list():
    print(20*'=')
    print(x)

@elonmusk @PeterSchiff Perfectly said.
@paulg The boom seats not being fully reclining beds makes it less attractive imo. A 12h biz-class flight where you can properly sleep "feels like" 4h, a 7h flight where you can't feels like 7h.

Hope they can get costs down and compete with economy/premium-econ soon!
@ImNotTheWolf I think so I have to check and if I did Ill make a tiktok
12/ Whereas, Bitcoin Cash (33% NH-able) and BitcoinSV (40% NH-able) are much much easier to 51% attack. If you can rent the hashrate, then there's no upfront costs and only the incremental costs are needed. That's why it's important for a POW coin to be mining algorithm dominant. https://t.co/8azgCCInGM
@AdamPaigge :)
@jack_zampolin I’m a fan of $Atom https://t.co/f4CfNfTQZK
RT @arca: DeFi’s Breakout Due to Real Value Accretion and Governance
https://t.co/2YusWr3KZZ
Including a quote from yours truly. #NFTs

https://t.co/AGfCVR5S2i
@WonTronSoup @certikorg Both DM’ed me. 

Now go to bed. Make sure nanny covers you

In [12]:
# prints dataset shape
df[mask].shape

(30069, 796)

In [26]:
# exports data
df[mask].to_pickle(LOCAL_PROCESSED_DATA_PATH / 'pretrain_dataset_20211016_zsc.pkl')

# Conclusion
> We could filters tweets with labels Crypto and Bitcoin with a reasonable accuracy (apparently :D)
>
> The drawback is that the amount of tweets available for training drastically decreased (from 90k to 30k). 
>
> This reduction might affect the model efficiency.