# Tweets Feature Engineering


<div style="text-align:center">
<img src="https://i.gifer.com/YHlm.gif" width='350'>
</div>

### <span style="color:#ff5f27;"> 📝 Imports</span>

In [1]:
import json
import io
import re
import time
import os.path
import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from datetime import timedelta, datetime
from dateutil import parser

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from tqdm import tnrange, tqdm_notebook, tqdm

from sklearn import preprocessing
import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings('ignore')

## <span style='color:#ff5f27'> 🪛 Preprocessing tweets from [kaggle](https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets) </span>

### I took this dataset from Kaggle because Twitter API allows us to parse only ~9-days old tweets. And here we have a couple of millions of them. So in this and next notebooks we will process this huge amount of data and in streamlit app we will just parse new recently created ones and add to our Feature Groups. 

## ‼️ Since [this dataset](https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets) is about 600 Mb, I do not provide it in this GutHub repository, you should manually download it from Kaggle and place that file in this folder

In [3]:
df_tweets = pd.read_csv("Bitcoin_tweets.csv")

In [4]:
df_tweets.shape

(3543853, 13)

In [5]:
df_tweets.columns

Index(['user_name', 'user_location', 'user_description', 'user_created',
       'user_followers', 'user_friends', 'user_favourites', 'user_verified',
       'date', 'text', 'hashtags', 'source', 'is_retweet'],
      dtype='object')

1. delete all where retweet=True
2. keep only text, date, new quality feature (likes * followers etc)
3. розсосать

In [6]:
df_tweets.user_verified = df_tweets.user_verified.astype(str)

In [7]:
df_tweets_correct = df_tweets[(df_tweets["user_verified"] == "False") | (df_tweets["user_verified"] == "True")]

In [8]:
df_tweets_correct

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,DeSota Wilson,"Atlanta, GA","Biz Consultant, real estate, fintech, startups...",2009-04-26 20:05:09,8534.0,7605,4838,False,2021-02-10 23:59:04,Blue Ridge Bank shares halted by NYSE after #b...,['bitcoin'],Twitter Web App,False
1,CryptoND,,😎 BITCOINLIVE is a Dutch platform aimed at inf...,2019-10-17 20:12:10,6769.0,1532,25483,False,2021-02-10 23:58:48,"😎 Today, that's this #Thursday, we will do a ""...","['Thursday', 'Btc', 'wallet', 'security']",Twitter for Android,False
2,Tdlmatias,"London, England","IM Academy : The best #forex, #SelfEducation, ...",2014-11-10 10:50:37,128.0,332,924,False,2021-02-10 23:54:48,"Guys evening, I have read this article about B...",,Twitter Web App,False
3,Crypto is the future,,I will post a lot of buying signals for BTC tr...,2019-09-28 16:48:12,625.0,129,14,False,2021-02-10 23:54:33,$BTC A big chance in a billion! Price: \487264...,"['Bitcoin', 'FX', 'BTC', 'crypto']",dlvr.it,False
4,Alex Kirchmaier 🇦🇹🇸🇪 #FactsSuperspreader,Europa,Co-founder @RENJERJerky | Forbes 30Under30 | I...,2016-02-03 13:15:55,1249.0,1472,10482,False,2021-02-10 23:54:06,This network is secured by 9 508 nodes as of t...,['BTC'],Twitter Web App,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3543848,🔔system'cRe5520',AWS eu-west-1a Ireland Region,The channel breakout trading strategy bot for ...,2012-06-02 07:46:44,236.0,16.0,63.0,False,2022-05-30 00:00:03,strategy: 5010HL1h atr20d: 1790.98\n\n🕛30 May ...,"['BTC', 'BitMEX']",system'cRe5520',False
3543849,Bitcoin Price Tracker,,Get frequent updates on Bitcoin Price.,2022-04-01 18:46:56,628.0,2.0,1.0,False,2022-05-30 00:00:02,"$BTC Price: $29,465 \n#Bitcoin #BTC #BitcoinPr...","['Bitcoin', 'BTC', 'BitcoinPrice', 'Crypto']",bitcoin_price_bot,False
3543850,Astro Bot,,Fully Automated Trading Bot. \nCreator: @Astro...,2020-11-07 04:35:37,68.0,1.0,39.0,False,2022-05-30 00:00:02,ASTRO BOT ALERT\n\n $BTCUSDT\n\n Timeframe:2h\...,"['AstroBotSignals', 'BTC', 'Cryptocurrency', '...",The_AstroBot,False
3543851,Job Preference,,You Don't Need a middleman to find a decent #J...,2020-11-28 15:56:01,1486.0,590.0,12165.0,False,2022-05-30 00:00:01,#Hiring?\nSign up now https://t.co/o7lVlsCHXv\...,"['Hiring', 'Jobs', 'Java', 'Programming', 'Cod...",Twitter Web App,False


In [9]:
df_tweets_incorrect = df_tweets.drop(list(df_tweets_correct.index))

In [10]:
df_tweets_corrected = df_tweets_incorrect.shift(periods=2, axis="columns")

In [11]:
df_tweets_processed = pd.concat([df_tweets_correct, df_tweets_corrected]).sort_values(by=["date"])

In [12]:
del df_tweets_correct
del df_tweets_corrected

In [13]:
df_tweets_processed.shape

(3543853, 13)

In [14]:
df_tweets_processed.head(5)

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
2612133,,,Electronics: Find tech deals for smartphones,watches,headphones,speakers,"Laptops and more.""",2021-10-09 06:21:24,1.0,35,20,False,2022-03-24 16:52:46
21524,Iconic Holding,"Frankfurt am Main, Germany",Professional Crypto Asset Ventures \nhttps://t...,2021-01-05 13:22:24,301.0,1075,361,False,2021-02-05 10:52:04,📖 Weekend Read 📖\n\nKeen to learn about #cryp...,['crypto'],Twitter Web App,False
21523,Iconic Holding,"Frankfurt am Main, Germany",Professional Crypto Asset Ventures \nhttps://t...,2021-01-05 13:22:24,301.0,1075,361,False,2021-02-05 10:52:04,2⃣ Debunking 9 #Bitcoin Myths by @Patrick_Lo...,"['Bitcoin', 'cryptocurrency', 'bitcoin', 'cryp...",Twitter Web App,False
21522,Iconic Holding,"Frankfurt am Main, Germany",Professional Crypto Asset Ventures \nhttps://t...,2021-01-05 13:22:24,301.0,1075,361,False,2021-02-05 10:52:06,4⃣ 🎙️ Bloomberg LP #CryptoOutlook 2021 with @...,"['CryptoOutlook', 'cryptocurrency', 'bitcoin',...",Twitter Web App,False
21521,Iconic Holding,"Frankfurt am Main, Germany",Professional Crypto Asset Ventures \nhttps://t...,2021-01-05 13:22:24,301.0,1075,361,False,2021-02-05 10:52:07,"5⃣ #Blockchain 50 2021 by @DelRayMan, @Forbe...","['Blockchain', 'cryptocurrency', 'bitcoin', 'c...",Twitter Web App,False


In [15]:
df_tweets_processed = df_tweets_processed[df_tweets_processed.date.notna()]

In [16]:
df_tweets_processed = df_tweets_processed.drop(2612133)

In [17]:
df_tweets_processed["source"]= df_tweets_processed["source"].apply(str)
df_tweets_processed["source"]= df_tweets_processed["source"].str.lower()
df_tweets_processed = df_tweets_processed[~df_tweets_processed["source"].str.contains("bot")]

In [18]:
df_tweets_processed = df_tweets_processed.loc[:,["date","text", "user_followers","user_friends", "user_favourites"]]

In [19]:
df_tweets_processed.date = pd.to_datetime(df_tweets_processed.date)
df_tweets_processed["text"] = df_tweets_processed["text"].apply(str)

In [20]:
df_tweets_processed.head(5)

Unnamed: 0,date,text,user_followers,user_friends,user_favourites
21524,2021-02-05 10:52:04,📖 Weekend Read 📖\n\nKeen to learn about #cryp...,301.0,1075,361
21523,2021-02-05 10:52:04,2⃣ Debunking 9 #Bitcoin Myths by @Patrick_Lo...,301.0,1075,361
21522,2021-02-05 10:52:06,4⃣ 🎙️ Bloomberg LP #CryptoOutlook 2021 with @...,301.0,1075,361
21521,2021-02-05 10:52:07,"5⃣ #Blockchain 50 2021 by @DelRayMan, @Forbe...",301.0,1075,361
21520,2021-02-05 10:52:26,#reddcoin #rdd @reddcoin to the moon #altcoin ...,37.0,123,410


In [21]:
df_tweets_processed = df_tweets_processed.sort_values(by='date')
df_tweets_processed.reset_index(inplace=True)
df_tweets_processed.drop(columns=["index"], inplace=True)

In [7]:
df_tweets_processed = df_tweets_processed[["date", "text"]]

In [8]:
df_tweets_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3411665 entries, 0 to 3411664
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   date    object
 1   text    object
dtypes: object(2)
memory usage: 52.1+ MB


## <span style='color:#ff5f27'>🧹 Text cleaning </span>

In [9]:
for i,s in enumerate(tqdm(df_tweets_processed['text'],position=0, leave=True)):
    text = str(df_tweets_processed.loc[i, 'text'])
    text = text.replace("#", "")
    text = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', text, flags=re.MULTILINE)
    text = re.sub('@\\w+ *', '', text, flags=re.MULTILINE)
    df_tweets_processed.loc[i, 'text'] = text

## <span style='color:#ff5f27'> 📥 Save the results</span>

In [10]:
df_tweets_processed.to_csv("data/tweets_processed.csv")