# Classification of 'CryptoCurrency' and 'Stocks' Subreddits

# Part 1: Data Extraction and Data Cleaning

## Problem Statement

The goal of this project is to build a binary classification model to predict if a post on reddit belongs to the "CryptoCurrency" or "Stocks" subreddit. The model will be considered successful if both the F1 score and sensitivity are above 90%.

Additionally, the project aims to provide insights on the key words and phrases that people use when discussing cryptocurrency on reddit, which could be used to develop a more effective marketing campaign for cryptocurrency. 

### Contents
1. [Background](#1.-Background)
2. [Data Extraction](#2.-Data-Extraction)
3. [Data Cleaning](#3.-Data-Cleaning)
4. [NLP Pre-Processing](#4.-NLP-Pre-Processing)

## 1. Background

CryptoGo is a FinTech startup that sells cryptocurrency products. They are looking to expand their customer base by running an online marketing campaign on the social media platform, Reddit. With a limited budget, CryptoGo wants to target their advertisements to all users on Reddit who have already expressed some interest in cryptocurrency and would therefore be more likely to invest in cryptocurrency products.

CryptoGo would like to leverage on Reddit's subreddit targeting model to ensure they are getting the relevant ads to the communities most interested in cryptocurrency. CryptoGo has already identified the "CryptoCurrency" subreddit as one of the subreddits they are looking to target. The "CryptoCurrency" subreddit is an active subreddit with over 3 million members posting news and discussions on cryptocurrency.

In order to identify similar posts across Reddit, CryptoGo has tasked its Data Science team to build a classification model that can predict whether or not a post belongs to the CryptoCurrency subreddit with at least a 90% sensitivity and F1 score. The team will build a binary classification model to identify whether a post belongs to the "CryptoCurrency" or "Stocks" subreddit. The "Stocks" subreddit is another active subreddit with over 2.8 million members. As the "Stocks" subreddit is also related to a popular financial product, training the model to predict whether a post belongs to "CryptoCurrency" or "Stocks" will allow CryptoGo to identify more relevant, cryptocurrency affiliated posts across Reddit than if the model was trained using a non-financial subreddit. This model can then be used to identify other posts on Reddit with a similar interest in cryptocurrency so CryptoGo can expand their marketing to other subreddits. 

Additionally, CryptoGo is looking to identify key words and phrases that people use when discussing cryptocurrency on social media, which will help them understand the market sentiment and develop their marketing campaign more effectively. 



## 2. Data Extraction

In [2]:
# import libraries
import requests
import pandas as pd
import numpy as np
import pickle

from datetime import datetime
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
import contractions

In [4]:
# function to pull n posts from a specific subreddit
def get_reddit(subreddit, n_samples, verbose = False): 
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
    'subreddit': subreddit,
    'size':(n_samples if n_samples <= 100 else 100)}
    
    res = requests.get(url,params)
    data = res.json()
    df_final = pd.DataFrame(data['data'])
    
    count = n_samples if n_samples <= 100 else 100
    n = n_samples - 100
    
    while n > 0:
        if (verbose == True) & (count%100 == 0):
            print(f'{count} rows generated')
                
        params = {
        'subreddit': subreddit,
        'size':(n if n <= 100 else 100),
        'before':data['data'][99]['created_utc'] - 700000 # 100 posts every 8 days
        }
        res = requests.get(url, params)
        data = res.json()
        df = pd.DataFrame(data['data'])
        df_final = df_final.append(df)
        n -= 100
        count += 100
    df_final = df_final.reset_index(drop=True)
    return df_final

- The function will pull 100 posts from a given subreddit every 8 days from the current data
- Therefore pulling 5000 extracts from each subreddit should give roughly a year's worth of data
- By pulling data from different weeks and different days of the week, the data will be less skewed towards current events/hot topics or repetitive weekly posts, so the model will be better at identifying future cryptocurrency posts (when the news cycle is different)

In [17]:
# create dataframe for 5000 extracts from Cryptocurrency sub
df_crypto = get_reddit('CryptoCurrency', 5000)

# create dataframe for 5000 extracts from Stocks sub
df_stocks = get_reddit('stocks', 5000)

In [18]:
# check shape of crypto extract
df_crypto.shape

(5000, 86)

In [57]:
# check extract dates
start = int(df_crypto['created_utc'][4999])
end = int(df_crypto['created_utc'][0])
print(f'start date:{datetime.fromtimestamp(start)}')
print(f'end date:{datetime.fromtimestamp(end)}')

start date:2020-05-01 12:32:32
end date:2021-06-23 16:57:53


- Posts are from 1 May 2020 to 23 June 2021

In [58]:
# check first 5 rows for crypto extract
df_crypto.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,...,discussion_type,suggested_sort,crosspost_parent,crosspost_parent_list,banned_by,edited,gallery_data,is_gallery,poll_data,gilded
0,[],False,PoojaaPriyaa,,Transitioning,"[{'e': 'text', 't': ' '}]",,dark,richtext,t2_ac4s8rai,...,,,,,,,,,,
1,[],False,nbloglinks,,Warning-level3,"[{'e': 'text', 't': '1 - 2 years account age. ...",1 - 2 years account age. -15 - 35 comment karma.,dark,richtext,t2_7zsy6ax8,...,,,,,,,,,,
2,[],False,Dwez1337,,Warning-level2,"[{'e': 'text', 't': 'Tin'}]",Tin,dark,richtext,t2_1agc3zqz,...,,,,,,,,,,
3,[],False,Ok_Alternative6359,,Warning-level3,"[{'e': 'text', 't': '1 - 2 years account age. ...",1 - 2 years account age. -15 - 35 comment karma.,dark,richtext,t2_4qpi40yg,...,,,,,,,,,,
4,[],False,ShillCoinMafia,,,[],,,text,t2_cvzi3ust,...,,,,,,,,,,


In [59]:
# check shape of stocks extract
df_stocks.shape

(5000, 78)

In [61]:
# check extract dates
start = int(df_stocks['created_utc'][4999])
end = int(df_stocks['created_utc'][0])
print(f'start date:{datetime.fromtimestamp(start)}')
print(f'end date:{datetime.fromtimestamp(end)}')

start date:2020-04-26 06:37:25
end date:2021-06-23 16:43:22


- Posts are from 26 April 2020 to 23 June 2021

In [62]:
# check first 5 rows for stocks extract
df_stocks.shape

(5000, 78)

In [63]:
# export to csv

df_crypto.to_csv('../datasets/crypto.csv', index= False)
df_stocks.to_csv('../datasets/stocks.csv', index = False)

## 3. Data Cleaning

In [134]:
# import raw files

df_crypto = pd.read_csv('../datasets/crypto.csv')
df_stocks = pd.read_csv('../datasets/stocks.csv')

# drop all columns except 'title', 'selftext' and 'subreddit'

col = ['title', 'selftext','subreddit', 'upvote_ratio']
df_crypto = df_crypto[col]
df_stocks = df_stocks[col]

### 3.1 CryptoCurrency Subreddit

In [135]:
#check first 5 rows
df_crypto.head()

Unnamed: 0,title,selftext,subreddit,upvote_ratio
0,Bank of Israel to Use Ethereum Tech for Digita...,,CryptoCurrency,1.0
1,Crypto Swap Profits Mastermind | How To Earn $...,,CryptoCurrency,1.0
2,"Cardano Founder: Tether is Faith based, Violat...",,CryptoCurrency,1.0
3,Experienced Miners,Hi everyone I need advice regarding the crypto...,CryptoCurrency,1.0
4,"Diamond hands, baby.",,CryptoCurrency,1.0


In [136]:
# check null values
df_crypto.isnull().sum()

title              0
selftext        2496
subreddit          0
upvote_ratio       0
dtype: int64

- Out of 5000 rows, 2496 are null values for selftext (i.e. the post has a title but no text) => let's investigate these rows 

In [137]:
df_crypto[df_crypto['selftext'].isnull()].head(5)

Unnamed: 0,title,selftext,subreddit,upvote_ratio
0,Bank of Israel to Use Ethereum Tech for Digita...,,CryptoCurrency,1.0
1,Crypto Swap Profits Mastermind | How To Earn $...,,CryptoCurrency,1.0
2,"Cardano Founder: Tether is Faith based, Violat...",,CryptoCurrency,1.0
4,"Diamond hands, baby.",,CryptoCurrency,1.0
5,Bitcoin Price Decline 😂,,CryptoCurrency,1.0


- text posts on reddit require titles but the text is optional.
- I will replace the null values with blank strings to indicate the text is blank

In [138]:
df_crypto['selftext'] = df_crypto['selftext'].fillna('')

In [139]:
# check for null values again
df_crypto.isnull().sum()

title           0
selftext        0
subreddit       0
upvote_ratio    0
dtype: int64

In [140]:
df_crypto[df_crypto['selftext'] != ''].tail()

Unnamed: 0,title,selftext,subreddit,upvote_ratio
4984,I am writing a research paper on “how will cry...,any research publications or anything that I c...,CryptoCurrency,1.0
4985,NWC the heart of Newscrypto,Check out this amazing post and why you should...,CryptoCurrency,1.0
4990,Can someone explain the point of cryptocurrenc...,[removed],CryptoCurrency,1.0
4997,Is there any 'app' or platform that allows for...,[removed],CryptoCurrency,1.0
4999,I am sending you 1π! Pi is a new digital curre...,[removed],CryptoCurrency,1.0


- removed are posts that have been removed by a moderator/admin/spam filter
- deleted are posts that users have deleted themselves

In [141]:
df_crypto[df_crypto['selftext']=='[removed]'].shape

(1580, 4)

In [142]:
df_crypto[df_crypto['selftext']=='[deleted]'].shape

(62, 4)

- Out of 5000 posts, 1580 were removed by the moderators, and 62 were deleted by the users themselves
- I will delete these columns for the purpose of this project because the goal is to identify posts that are similar to what users post in the CryptoCurrency subreddit, so I do not want to train my model on posts that moderators have alredy identified as irrelevant to this subreddit

In [143]:
# function to remove "removed" and "deleted" posts
def remove_del(df):
    mask = np.logical_not(df['selftext'].isin(['[removed]','[deleted]']))
    return df[mask]

In [144]:
df_crypto = remove_del(df_crypto)

In [145]:
# remove filler text posts
df_crypto = df_crypto[~df_crypto['selftext'].str.contains('filler')]

In [146]:
#check shape
df_crypto.shape

(3357, 4)

- For analysis, I want to combine the title and self text columns
- My classifications will be based on the words in both the title and text of the reddit posts

In [147]:
df_crypto['text'] = df_crypto['title'] + df_crypto['selftext']

In [148]:
# check for duplicates
df_crypto[df_crypto['text'].duplicated()].shape

(57, 5)

In [149]:
# drop duplicate rows
df_crypto.drop_duplicates(inplace = True)

### 3.2 Stocks Subreddit

In [150]:
#check first 5 rows
df_stocks.head()

Unnamed: 0,title,selftext,subreddit,upvote_ratio
0,With net profit compound growth of 145% from 2...,[removed],stocks,1.0
1,Interested in investing 10k into an index fund...,Looking to invest in an index fund but I’m som...,stocks,1.0
2,Looking for a Semi-Professional Trader,[removed],stocks,1.0
3,Korn Ferry fee revenue of $555.2 million in Q4...,[removed],stocks,1.0
4,Terranet will mabe be the new leader of ADAS m...,[removed],stocks,1.0


In [151]:
# check null values
df_stocks.isnull().sum()

title             0
selftext         81
subreddit         0
upvote_ratio    100
dtype: int64

- Out of 5000 posts, 81 have null values for text
- Interesting, compared to CryptoCurrency which had 2496 null values -> more people post titles without further text in the CryptoCurrency subreddit

In [152]:
df_stocks[df_stocks['selftext'].isnull()].head(5)

Unnamed: 0,title,selftext,subreddit,upvote_ratio
72,Recommend me some good books to read on invest...,,stocks,1.0
105,Any opinions on amc?,,stocks,1.0
107,Canadian Bitcoin miners massively undevalued? ...,,stocks,1.0
108,RLLCF is in the middle of a pump and dump....,,stocks,1.0
113,RLLCF is in the middle of a pump and dump....,,stocks,1.0


- I will replace the selftext with blank (same as I did for CryptoCurrency)

In [153]:
df_stocks['selftext'] = df_stocks['selftext'].fillna('')

In [154]:
df_stocks[df_stocks['selftext'] != ''].head()

Unnamed: 0,title,selftext,subreddit,upvote_ratio
0,With net profit compound growth of 145% from 2...,[removed],stocks,1.0
1,Interested in investing 10k into an index fund...,Looking to invest in an index fund but I’m som...,stocks,1.0
2,Looking for a Semi-Professional Trader,[removed],stocks,1.0
3,Korn Ferry fee revenue of $555.2 million in Q4...,[removed],stocks,1.0
4,Terranet will mabe be the new leader of ADAS m...,[removed],stocks,1.0


In [155]:
df_stocks[df_stocks['selftext']=='[removed]'].shape

(2383, 4)

In [156]:
df_stocks[df_stocks['selftext']=='[deleted]'].shape

(33, 4)

- Out of 5000 posts, 2383 were removed by moderators, and 33 were deleted by users themselves
- This is significantly more posts removed than the CryptoCurrency subreddit => lets look at a few of these posts

In [157]:
with pd.option_context('display.max_colwidth', None):
  display(df_stocks[df_stocks['selftext'] == '[removed]'].head(5))

Unnamed: 0,title,selftext,subreddit,upvote_ratio
0,"With net profit compound growth of 145% from 2018 to 2020, Medlive passes HK Exchanges’ hearing for listing",[removed],stocks,1.0
2,Looking for a Semi-Professional Trader,[removed],stocks,1.0
3,"Korn Ferry fee revenue of $555.2 million in Q4 FY’21, an increase of 26%",[removed],stocks,1.0
4,Terranet will mabe be the new leader of ADAS market,[removed],stocks,1.0
5,Terranet will take over the ADAS market,[removed],stocks,1.0


- It seems the stocks subreddit has more posts that moderators may consider spam or inappropriate than the cryptocurrency subreddit, or the stocks subreddit has stricter community guidelines 
- Similar to the cryptocurrency subreddit, I will delete all rows where the self text has been removed by mods or deleted by the users

In [158]:
df_stocks = remove_del(df_stocks)

In [159]:
# check shape
df_stocks.shape

(2584, 4)

In [160]:
df_stocks['text'] = df_stocks['title'] + df_stocks['selftext']

In [161]:
# check for duplicates
df_stocks[df_stocks['text'].duplicated()]

Unnamed: 0,title,selftext,subreddit,upvote_ratio,text
113,RLLCF is in the middle of a pump and dump....,,stocks,1.0,RLLCF is in the middle of a pump and dump....
1326,SEC Stops Trading on 15 Stocks Including GME,,stocks,1.0,SEC Stops Trading on 15 Stocks Including GME
1343,Due Diligence &amp; Analysis on Stocks. 16 thi...,,stocks,1.0,Due Diligence &amp; Analysis on Stocks. 16 thi...
2162,Shift technologies ($sft),"If you like carvana, you’ll love shift. It pla...",stocks,1.0,"Shift technologies ($sft)If you like carvana, ..."
2196,Thinking about putting 1/2 of my portfolio int...,Is this a good idea? I mainly invest in tech a...,stocks,1.0,Thinking about putting 1/2 of my portfolio int...
2734,How can I invest in forex as an Indian and are...,I am 22 and just started working in one of the...,stocks,1.0,How can I invest in forex as an Indian and are...


In [162]:
# drop duplicates
df_stocks.drop_duplicates(inplace = True)

### 3.3 Combine dataframes 

In [163]:
df_crypto.shape

(3303, 5)

In [164]:
df_stocks.shape

(2578, 5)

- I have around 800 more posts from crypto than stocks, since many of the posts in stocks were removed by moderators
- I want balanced classes when I perform my EDA and train my models, so I will take 2500 posts from both datasets

In [165]:
df_crypto = df_crypto.sample(n=2500, replace = False, random_state = 23).reset_index(drop=True)
df_stocks = df_stocks.sample(n=2500, replace = False, random_state = 23).reset_index(drop=True)

In [171]:
# create combined dataframe
df = pd.concat([df_crypto,df_stocks]).reset_index(drop = True)
df.shape

(5000, 5)

In [172]:
# drop the title and self text columns
df.drop(['title','selftext'],axis = 1, inplace = True)

## 4. NLP Pre-Processing

- Before I move to my EDA there are some pre-processing steps for text that may be useful
- This will help me to find insights from the text data more easily during my EDA
- These steps are:
> 1. Remove special characters
> 2. Tokenize
> 3. Remove stop words
> 4. Lemmatize

### 4.1 Remove special characters

In [173]:
# convert everything to lower string first
df['text_adj'] = df['text'].str.lower()

In [187]:
# Lets view a long post
df[df['text_adj'].str.len()>500]['text_adj'][9]

'which app will allow me to trade without any glitches and bullshit of i need to sell or buy at any given time?just got into crypto because doge. looking to put some money in btc and what not as i may as well i best into something that isn’t s meme coin. so far the apps i’ve used are pretty fucking janky. coinbase pro will not for the life of me upload my damn ids. i had to open the normal coinbase app and upload from there. didn’t catch. then it took me to the web sight and showed my id files there after i had uploaded but the app still says i need to upload id.\n\nthen i went to voyager. transferred some funds that still say “pending” after a few days. i made some trades but can’t sell or anything because rrh trade is still pending. it’s been 5 business days. \ni’m not confident that when i need to sell the app is going to just magically work with no issues and i don’t want to be in that position. i started on robinhood and am trying to go away from that because of its history of fuc

In [188]:
# use regex to remove line breaks
df['text_adj'] = df['text_adj'].map(lambda x: re.sub('\n', ' ', x)) 

In [189]:
# confirm line breaks removed
df['text_adj'][9]

'which app will allow me to trade without any glitches and bullshit of i need to sell or buy at any given time?just got into crypto because doge. looking to put some money in btc and what not as i may as well i best into something that isn’t s meme coin. so far the apps i’ve used are pretty fucking janky. coinbase pro will not for the life of me upload my damn ids. i had to open the normal coinbase app and upload from there. didn’t catch. then it took me to the web sight and showed my id files there after i had uploaded but the app still says i need to upload id.  then i went to voyager. transferred some funds that still say “pending” after a few days. i made some trades but can’t sell or anything because rrh trade is still pending. it’s been 5 business days.  i’m not confident that when i need to sell the app is going to just magically work with no issues and i don’t want to be in that position. i started on robinhood and am trying to go away from that because of its history of fuckin

In [190]:
# create function to remove contractions
 
def expand_text(text):
    expanded_words = []
    for word in text.split():
        expanded_words.append(contractions.fix(word))
    return ' '.join(expanded_words)

In [191]:
df['text_adj'] = df['text_adj'].apply(lambda x: expand_text(x))

In [192]:
# confirm contractions have been removed
df['text_adj'][9]

'which app will allow me to trade without any glitches and bullshit of i need to sell or buy at any given time?just got into crypto because doge. looking to put some money in btc and what not as i may as well i best into something that is not s meme coin. so far the apps I have used are pretty fucking janky. coinbase pro will not for the life of me upload my damn ids. i had to open the normal coinbase app and upload from there. did not catch. then it took me to the web sight and showed my id files there after i had uploaded but the app still says i need to upload id. then i went to voyager. transferred some funds that still say “pending” after a few days. i made some trades but cannot sell or anything because rrh trade is still pending. it is been 5 business days. I am not confident that when i need to sell the app is going to just magically work with no issues and i do not want to be in that position. i started on robinhood and am trying to go away from that because of its history of 

In [194]:
# check for URLs
df[df['text_adj'].str.contains('http')]['text_adj'][16]

'phoenix token presalephoenix token presale is now live! check it out on [https://dxsale.app/app/pages/defipresale?saleid=1320&amp;chain=bsc](https://dxsale.app/app/pages/defipresale?saleid=1320&amp;chain=bsc). devs are based! all liquidity+tokens=locked!'

In [195]:
# use regex to remove URLs
df['text_adj'] = df['text_adj'].map(lambda x: re.sub('(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-;]*[\w@?^=%&\/~+#-])?', ' ', x)) 

#use regex to remove HTML coding (&amp, &gt, etc)
df['text_adj'] = df['text_adj'].map(lambda x: re.sub('-?&\w+', ' ', x)) 

In [196]:
df['text_adj'][16]

'phoenix token presalephoenix token presale is now live! check it out on [ ]( ). devs are based! all liquidity+tokens=locked!'

### 4.2 Tokenize

In [197]:
# tokenize for words, currencies, or percentages
tokenizer = RegexpTokenizer('((?:[A-Za-z]\.)+|[$¥£€%]{0,1}(?:\d{1,10}[,. ])*\d{1,10}[$¥£€%]{0,1}|\w+)')

In [198]:
df['token'] = df['text_adj'].map(lambda x: tokenizer.tokenize(x.lower()))

In [199]:
df.head()

Unnamed: 0,subreddit,upvote_ratio,text,text_adj,token
0,CryptoCurrency,1.0,Media is bullish again,media is bullish again,"[media, is, bullish, again]"
1,CryptoCurrency,1.0,Half of All Bitcoin Open Interest on CME Set t...,half of all bitcoin open interest on cme set t...,"[half, of, all, bitcoin, open, interest, on, c..."
2,CryptoCurrency,1.0,BBC release Doctor Who trading cards using NFT...,bbc release doctor who trading cards using nft...,"[bbc, release, doctor, who, trading, cards, us..."
3,CryptoCurrency,1.0,What's going on with bitcoin prehalving????,what is going on with bitcoin prehalving????,"[what, is, going, on, with, bitcoin, prehalving]"
4,CryptoCurrency,1.0,Quantum Financial Reset To A 3D/5D Hybrid Stat...,quantum financial reset to a 3d/5d hybrid stat...,"[quantum, financial, reset, to, a, 3, d, 5, d,..."


In [201]:
# check tokens for first row are correct
df['token'][0]

['media', 'is', 'bullish', 'again']

### 4.3 Stop word removal

In [202]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [203]:
stop = stopwords.words('english')

In [204]:
df['token'] = df['token'].apply(lambda x: [item for item in x if item not in stop])

### 4.4 Lemmatize

- I will lemmatize my tokens to return the base forms (i.e. plural and singular words can be combined for my EDA)

In [205]:
lem = WordNetLemmatizer()

In [206]:
df['token'] = df['token'].apply(lambda x: [lem.lemmatize(i) for i in x] )

In [207]:
df.head()

Unnamed: 0,subreddit,upvote_ratio,text,text_adj,token
0,CryptoCurrency,1.0,Media is bullish again,media is bullish again,"[medium, bullish]"
1,CryptoCurrency,1.0,Half of All Bitcoin Open Interest on CME Set t...,half of all bitcoin open interest on cme set t...,"[half, bitcoin, open, interest, cme, set, expi..."
2,CryptoCurrency,1.0,BBC release Doctor Who trading cards using NFT...,bbc release doctor who trading cards using nft...,"[bbc, release, doctor, trading, card, using, n..."
3,CryptoCurrency,1.0,What's going on with bitcoin prehalving????,what is going on with bitcoin prehalving????,"[going, bitcoin, prehalving]"
4,CryptoCurrency,1.0,Quantum Financial Reset To A 3D/5D Hybrid Stat...,quantum financial reset to a 3d/5d hybrid stat...,"[quantum, financial, reset, 3, 5, hybrid, stat..."


In [208]:
# pickle final dataframe
df.to_pickle('../datasets/final')

# Proceed to Notebook 2. Exploratory Data Analysis 