#### The following notebook includes:
1. Initial analysis of the data
2. Sampling for computation ease
3. Regular Expression Transformation

## Libraries

In [1]:
import re
import pandas as pd
from tqdm import tqdm

File names:

In [2]:
directory = '~/PycharmProjects/tfm_hugopobil'
tweets_fn = f'{directory}/data/bitcoin_tweets.csv'
bitcoin_price_fn = f'{directory}/dataBTC-USD.csv'

## Visualization

We import the dataset which includes the tweets we have downloaded from Internet. These tweets include date ranges from 10th February 2021 until 2 March 2022.

In order to perform all the computation we will reduce the dataset to 1% of total data. Meaning we will analyse and predict using 23470 initial tweets.

In [3]:
# Do not execute, 2 million tweets coming...
tweets = pd.read_csv(tweets_fn, low_memory=False)
print('Shape :', tweets.shape)
print('Tweets DataFrame Initial Date :', tweets.date[0])
print('Tweets DataFrame Final Date :', tweets.date.iloc[-1])
tweets.head()

Shape : (2347470, 13)
Tweets DataFrame Initial Date : 2021-02-10 23:59:04
Tweets DataFrame Final Date : 2022-03-02 16:38:11


Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,DeSota Wilson,"Atlanta, GA","Biz Consultant, real estate, fintech, startups...",2009-04-26 20:05:09,8534.0,7605,4838,False,2021-02-10 23:59:04,Blue Ridge Bank shares halted by NYSE after #b...,['bitcoin'],Twitter Web App,False
1,CryptoND,,😎 BITCOINLIVE is a Dutch platform aimed at inf...,2019-10-17 20:12:10,6769.0,1532,25483,False,2021-02-10 23:58:48,"😎 Today, that's this #Thursday, we will do a ""...","['Thursday', 'Btc', 'wallet', 'security']",Twitter for Android,False
2,Tdlmatias,"London, England","IM Academy : The best #forex, #SelfEducation, ...",2014-11-10 10:50:37,128.0,332,924,False,2021-02-10 23:54:48,"Guys evening, I have read this article about B...",,Twitter Web App,False
3,Crypto is the future,,I will post a lot of buying signals for BTC tr...,2019-09-28 16:48:12,625.0,129,14,False,2021-02-10 23:54:33,$BTC A big chance in a billion! Price: \487264...,"['Bitcoin', 'FX', 'BTC', 'crypto']",dlvr.it,False
4,Alex Kirchmaier 🇦🇹🇸🇪 #FactsSuperspreader,Europa,Co-founder @RENJERJerky | Forbes 30Under30 | I...,2016-02-03 13:15:55,1249.0,1472,10482,False,2021-02-10 23:54:06,This network is secured by 9 508 nodes as of t...,['BTC'],Twitter Web App,False


In [4]:
tweets.columns

Index(['user_name', 'user_location', 'user_description', 'user_created',
       'user_followers', 'user_friends', 'user_favourites', 'user_verified',
       'date', 'text', 'hashtags', 'source', 'is_retweet'],
      dtype='object')

In [7]:
tweets.sort_values(by = 'date').tail(20)

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
2339103,Hodlers Journey 🍁 ⚡️,,#bitcoin,2021-04-08 18:39:21,236.0,1126,715,False,2022-03-02 23:59:49,@jkenney Now eliminate the debt and add #bitco...,['bitcoin'],Twitter for iPhone,False
2339102,The Crypto Curator #BTC100K 🔥 🚀 🏳️‍🌈 🟩,Everywhere ;),#Bitcoin Evangelist and Aficionado. Military #...,2009-01-17 13:25:27,27860.0,10162,139886,False,2022-03-02 23:59:51,@TartishaHill Congrats! Have you heard about $...,['Bitcoin'],Twitter Web App,False
2339101,Raquel morina,Australia,Graduated student🙄,2021-12-10 07:03:14,94.0,137,102,False,2022-03-02 23:59:51,"If anything, WOLVERINU is going to moon with t...","['crypto', 'bitcoin', 'cryptocurrency', 'btc',...",Twitter Web App,False
2339100,Blessed Mom of 3 kiddos~I 💗 them so much~🕊️🥀,"Albuquerque, NM","💜Mom of Doom, G-Ray & CeeJ💜#TheBeKindImpact 💝 ...",2011-06-07 14:40:15,336.0,878,11043,False,2022-03-02 23:59:52,@TheMoonCarl #Solana #MATIC maybe #Ada but if ...,"['Solana', 'MATIC', 'Ada', 'Bitcoin']",Twitter for Android,False
2339099,Gery Rodriguez,Across Space,#Bitcoin,2011-12-23 18:49:39,214.0,234,4698,False,2022-03-02 23:59:53,Whatever the mainstream media is narrating and...,['Bitcoin'],Twitter Web App,False
2339098,Ryan de Mateo,,,2019-03-06 11:27:39,3.0,155,27,False,2022-03-02 23:59:55,create twitter tasks and pay with #bitcoin #et...,"['bitcoin', 'ethereum', 'litecoin']",Twitter for Android,False
2339097,kei arisa mugo,,,2021-08-27 20:30:14,15.0,52,34,False,2022-03-02 23:59:56,"If anything, WOLVERINU is going to moon with t...","['crypto', 'bitcoin', 'cryptocurrency', 'btc',...",Twitter Web App,False
2341203,illegalmonkey77,NH,"I am no one and everyone. I see all, yet am bl...",2012-02-03 12:51:14,110.0,84,4383,False,2022-03-02 23:59:56,@MrDiamondhandz1 @saitanobi @InuSaitama @Shib_...,"['saitanobi', 'saitanobiRoadto1b', 'eth', 'btc...",Twitter Web App,False
2339096,Hem,,,2020-02-13 08:56:22,19.0,79,196,False,2022-03-02 23:59:59,"If anything, WOLVERINU is going to moon with t...","['crypto', 'bitcoin', 'cryptocurrency', 'btc',...",Twitter Web App,False
1811149,jadii nine is 9 :D,2011-01-20 02:00:55,200.0,229,44.0,False,2021-11-18 13:26:39,@airdropinspect Good and special project\n@anc...,"['Airdrop', 'Airdrops', 'Airdropinspector', 'B...",Twitter Web App,False,,


### Stratified sampling:

Data shows some imported rows give error, I delete those last 12 rows to avoid getting more errors.

In [10]:
print(type(tweets.date))
tweets = tweets.sort_values(by = 'date')
tweets = tweets[0:len(tweets)-12]

<class 'pandas.core.series.Series'>


In [11]:
tweets['sample_date'] = tweets['date'].apply(lambda x: x.split(' ', 1)[0])
tweets['sample_date'] = tweets.sample_date.apply(lambda x: pd.to_datetime(x))

In [12]:
tweets_sample = tweets.groupby('sample_date', group_keys=False).apply(lambda x: x.sample(200))

In [13]:
tweets_sample[tweets_sample.sample_date == '2021-02-05'].count()

user_name           200
user_location       114
user_description    186
user_created        200
user_followers      200
user_friends        200
user_favourites     200
user_verified       200
date                200
text                200
hashtags            148
source              199
is_retweet          200
sample_date         200
dtype: int64

In [14]:
tweets_sample.shape

(23200, 14)

## Data Cleaning

The first step when dealing with tweets is to clean the tweet body data. By appliying the 1% reduction to all tweets, we then apply basic regular expression treatment and save the data to CSV.

In [25]:
# clean tweets
tweets_sample = tweets_sample.sort_values(by = 'date')

# Select a 1% random sample of all tweets
dd = tweets_sample.copy()
dd = dd.reset_index()

for i, s in enumerate(tqdm(dd['text'], position=0, leave=True)):
    text = str(dd.loc[i, 'text'])
    text = text.replace('#', '')
    text = text.lower()
    text = re.sub("@[A-Za-z0-9_]+","", text)
    text = re.sub("#[A-Za-z0-9_]+","", text)
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"www.\S+", "", text)
    text = re.sub('[()!?]', ' ', text)
    text = re.sub('\[.*?\]',' ', text)
    text = re.sub("[^a-z0-9]"," ", text)
    text = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', text, flags=re.MULTILINE)
    text = re.sub('@\\w+ *', '', text, flags=re.MULTILINE)
    dd.loc[i, 'text'] = text

dd.to_csv(f'{directory}/data/sampled_data/tweets_clean_v2.csv', header=True, encoding='utf-8', index=False)

100%|██████████| 23200/23200 [00:01<00:00, 16561.00it/s]


In [27]:
# Some examples of tweets processed
dd.text.iloc[0:10]

0     perl 0 06  i have insisted that since 0 02 it...
1      are we talking about bitcoin  sure  17 usd d...
2      prices update in  usdt  1 hour     btc   376...
3    dominus and johnewbanks i messed up in my firs...
4      prices update in  usd  1 hour     btc     37...
5    hey  the cscalp team recorded a video on how t...
6    bitcoin can breakout any second     technicala...
7     for cheap network fees coming next month  in ...
8    hey guys  i want buy a gaming pc  donate me pl...
9    surge in bitcoin energy consumption sparks deb...
Name: text, dtype: object

### End

Save to local the data after processing visualization and cleaning. We will start the next steps using the CSV 'tweets_clean.csv'