#### The following notebook includes:
1. Initial analysis of the data
2. Sampling for computation ease
3. Regular Expression Transformation

## Libraries

In [1]:
import re
import pandas as pd
from tqdm import tqdm

File names:

In [2]:
directory = '~/PycharmProjects/tfm_hugopobil'
tweets_fn = f'{directory}/data/bitcoin_tweets.csv'
bitcoin_price_fn = f'{directory}/dataBTC-USD.csv'

## Visualization

We import the dataset which includes the tweets we have downloaded from Internet. These tweets include date ranges from 10th February 2021 until 2 March 2022.

In order to perform all the computation we will reduce the dataset to 1% of total data. Meaning we will analyse and predict using 23470 initial tweets.

In [20]:
# Do not execute, 2 million tweets coming...
tweets = pd.read_csv(tweets_fn, low_memory=False)
print('Shape :', tweets.shape)
print('Tweets DataFrame Initial Date :', tweets.date[0])
print('Tweets DataFrame Final Date :', tweets.date.iloc[-1])
tweets.head()

Shape : (2347470, 13)
Tweets DataFrame Initial Date : 2021-02-10 23:59:04
Tweets DataFrame Final Date : 2022-03-02 16:38:11


Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,DeSota Wilson,"Atlanta, GA","Biz Consultant, real estate, fintech, startups...",2009-04-26 20:05:09,8534.0,7605,4838,False,2021-02-10 23:59:04,Blue Ridge Bank shares halted by NYSE after #b...,['bitcoin'],Twitter Web App,False
1,CryptoND,,😎 BITCOINLIVE is a Dutch platform aimed at inf...,2019-10-17 20:12:10,6769.0,1532,25483,False,2021-02-10 23:58:48,"😎 Today, that's this #Thursday, we will do a ""...","['Thursday', 'Btc', 'wallet', 'security']",Twitter for Android,False
2,Tdlmatias,"London, England","IM Academy : The best #forex, #SelfEducation, ...",2014-11-10 10:50:37,128.0,332,924,False,2021-02-10 23:54:48,"Guys evening, I have read this article about B...",,Twitter Web App,False
3,Crypto is the future,,I will post a lot of buying signals for BTC tr...,2019-09-28 16:48:12,625.0,129,14,False,2021-02-10 23:54:33,$BTC A big chance in a billion! Price: \487264...,"['Bitcoin', 'FX', 'BTC', 'crypto']",dlvr.it,False
4,Alex Kirchmaier 🇦🇹🇸🇪 #FactsSuperspreader,Europa,Co-founder @RENJERJerky | Forbes 30Under30 | I...,2016-02-03 13:15:55,1249.0,1472,10482,False,2021-02-10 23:54:06,This network is secured by 9 508 nodes as of t...,['BTC'],Twitter Web App,False


### Stratified sampling:

Why?...

In [28]:
print(type(tweets.date))
tweets = tweets.sort_values(by = 'date')
tweets = tweets[0:len(tweets)-12]

<class 'pandas.core.series.Series'>


In [29]:
tweets['sample_date'] = tweets['date'].apply(lambda x: x.split(' ', 1)[0])
tweets['sample_date'] = tweets.sample_date.apply(lambda x: pd.to_datetime(x))

In [37]:
tweets_sample = tweets.groupby('sample_date', group_keys=False).apply(lambda x: x.sample(200))

In [38]:
tweets_sample.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet,sample_date
20754,LiveFreeDieHappy,,#Bitcoin,2011-08-28 13:03:36,26.0,159,592,False,2021-02-05 16:34:31,Did you know that #Bitcoin has a private and p...,"['Bitcoin', 'BTC']",Twitter for iPhone,False,2021-02-05
20387,Praisewizzy,,Don't Give A Fuck!!,2020-02-10 11:28:33,272.0,490,187,False,2021-02-05 18:58:55,"This one wey piggy vest dey trend, Wetin dey o...","['PiggyVest', 'Bitcoin', 'cryptocurrency', 'BT...",Twitter for Android,False,2021-02-05
20014,Fracking Miner,,Plug & play cryptocurrency mining platform and...,2018-10-25 14:59:32,37.0,9,19,False,2021-02-05 21:59:03,"The CEO of CryptoQuant says a recent 15,200 BT...",,CoSchedule,False,2021-02-05
20129,theCryptoJourney,Netherlands,Follow me on my journey! Starting with $2000 a...,2021-01-29 09:51:50,59.0,226,177,False,2021-02-05 21:02:42,#mycryptojourney Day 8: a late buy on #binance...,"['mycryptojourney', 'binance', 'VIB', 'crypto'...",Twitter Web App,False,2021-02-05
19888,Crypto News Analyst,"Atlanta, GA",#Bitcoin We aim to bring you the latest Headl...,2021-01-15 19:34:43,36.0,23,432,False,2021-02-05 23:13:08,🚀It feels like there is something new brewing ...,"['DeFi', 'Ethereum', 'DeFi']",Twitter for Android,False,2021-02-05


## Data Cleaning

The first step when dealing with tweets is to clean the tweet body data. By appliying the 1% reduction to all tweets, we then apply basic regular expression treatment and save the data to CSV.

In [40]:
# clean tweets
tweets_sample = tweets_sample.sort_values(by = 'date')

# Select a 1% random sample of all tweets
dd = tweets_sample.copy()
dd = dd.reset_index()

for i, s in enumerate(tqdm(dd['text'], position=0, leave=True)):
    text = str(dd.loc[i, 'text'])
    text = text.replace('#', '')
    text = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', text, flags=re.MULTILINE)
    text = re.sub('@\\w+ *', '', text, flags=re.MULTILINE)
    dd.loc[i, 'text'] = text

dd.to_csv(f'{directory}/data/sampled_data/tweets_clean_v2.csv', header=True, encoding='utf-8', index=False)

100%|██████████| 23200/23200 [00:01<00:00, 20041.90it/s]


### End

Save to local the data after processing visualization and cleaning. We will start the next steps using the CSV 'tweets_clean.csv'