# Data Exploration

Goal is to analyze the given crypto related comments dataset and look for any quirks and characteristics. 

In [1]:
import re

import emoji
import pandas as pd

pd.set_option('max_colwidth', None)

In [2]:
crypto_sentiment_dataset_path = '../datasets/crypto_reddit_sentiment.csv'
df = pd.read_csv(crypto_sentiment_dataset_path)

In [11]:
df.head()

Unnamed: 0,worker_id,is_reviewed,review_score,Comment Text,Sentiment,Reddit URL
0,XYNN2Y4VCF3G,False,,"I bought 2200 at the ico, at 0.50$ per coin. Hold everything and sold it 3 months ago and it helped me to buy a bigger house.",Positive,https://www.reddit.com/r/Avax/comments/uzggar/comment/iabc390/?utm_source=reddit&utm_medium=web2x&context=3
1,DR6XNZMT9KRH,False,,"Harmony one , algorand , Cardano, solana , vechain gonna fly if there is ever a next bull market\n\nOtherwise just buy and stack satoshis",Positive,https://www.reddit.com/r/CryptoCurrency/comments/v09a1p/comment/iag2c78/?utm_source=reddit&utm_medium=web2x&context=3
2,9FCQGMYD4A42,False,,"Honestly, after reading this post and many of the responses, I have to conclude most of the crypto-space is totally fucked.\n\nThe consept of crypto has been entirely lost, waves of noobs arrive on crypto island, and instead of revelling in the freedom, do everything they can to plan their way to get back off of the island.",Negative,https://www.reddit.com/r/CryptoCurrency/comments/uijvn6/its_annoying_to_see_bitcoin_follows_the_stock/
3,QEZAEMV2WF9D,False,,In bear market is where money is made. I Will continue to DCA to the assets i believe.,Positive,https://www.reddit.com/r/CryptoCurrency/comments/uzwmf6/comment/iacwnpv/?utm_source=reddit&utm_medium=web2x&context=3
4,Z7J7W3XCP4XC,False,,"Funny how people think Bitcoin's risk is comparable to stocks. A lot of these crypto ""investors"" are gonna learn the hard way sooner or later.",Negative,https://www.reddit.com/r/investing/comments/um4inc/bitcoin_tumbles_more_than_50_below_its_alltime/


### How many comments are in the dataset?

In [5]:
len(df)

562

### Are there duplicates?

In [7]:
df['Comment Text'].value_counts()[:5]

There has been three fraudulent scam coins from Do Kwon and the Terra team..\n\nBasic Coin (look it up).. his first failed stablecoin where people lost everything\n\nLuna v. 1\n\nUST\n\nand now the Luna v. 2 scam\n\nThis guy is going to fucking ruin crypto for everyone. This is one case where the scammers (Do and Terra) need to be held accountable and made an example out of hopefully.    2
good time to accumulate                                                                                                                                                                                                                                                                                                                                                                                2
It's going on sale! I'm gonna get so fat on bitcoin that when I sell it people are going to call me a whale!                                                                                                          

### There are duplicates in the comment text, so let's remove them and calculate dataset length again.

In [9]:
df = df.drop_duplicates(subset='Comment Text')

In [10]:
len(df)

552

### What subreddits are the comments from?

In [12]:
def extract_reddit_community(url):
    community_search = re.search('reddit.com\/r\/(\w+)\/', url, re.IGNORECASE)
    return community_search.group(1)

In [13]:
df['reddit_community'] = df['Reddit URL'].apply(extract_reddit_community)

In [23]:
df['reddit_community'].value_counts(normalize=True)[:10]

CryptoCurrency       0.538043
Bitcoin              0.128623
terraluna            0.070652
ethereum             0.030797
dogecoin             0.030797
investing            0.027174
UKPersonalFinance    0.023551
eupersonalfinance    0.016304
SHIBArmy             0.016304
CanadianInvestor     0.014493
Name: reddit_community, dtype: float64

### Are any of these comments reviewed?

If most of the comments were reviewed and had a score, I'd check the quality of non-reviewed comments only if it's not too much work to check manually (e.g. a few comments).
If quality is poor, we may want to remove them from training. 

In [15]:
df['is_reviewed'].value_counts()

False    552
Name: is_reviewed, dtype: int64

### What's the label distribution percentage-wise?

In [16]:
df['Sentiment'].value_counts(normalize=True)

Positive    0.536232
Negative    0.463768
Name: Sentiment, dtype: float64

### How long are the comments?

In [17]:
df['comment_char_length'] = df['Comment Text'].apply(len)

In [18]:
df['comment_char_length'].describe()

count     552.000000
mean      228.748188
std       388.573939
min         7.000000
25%        58.000000
50%       115.000000
75%       231.250000
max      4229.000000
Name: comment_char_length, dtype: float64

In [19]:
df['comment_tokens_count'] = df['Comment Text'].apply(lambda text: len(text.strip().split()))

In [20]:
df['comment_tokens_count'].describe()

count    552.000000
mean      40.822464
std       67.754690
min        2.000000
25%       10.750000
50%       21.000000
75%       41.000000
max      701.000000
Name: comment_tokens_count, dtype: float64

### Get longest comment to evaluate model's response time later

In [21]:
df.iloc[df['comment_tokens_count'].idxmax()]['Comment Text']

'why I don\'t like crypto, ethically, financially and technologically:\n\nproof of work mining (e.g. Bitcoin) wastes ungodly amounts of electricity (until the world is 100% renewables mining displaces more useful work and thus creates emissions) and computing capacity\nproof of space is the same but for making the world a worse place by driving up storage costs for everybody\ncoin speculation is a massive ponzi scheme and no crypto fan acknowledges this\nthe e.g. bitcoin network throughput is ~dozens of qps which would merely be hilarious if it wasn\'t burning as much coal as Australia\njust on pure waste: Bitcoin production is estimated to generate between 22 and 22.9 million metric tons of carbon dioxide emissions a year, or between the levels produced by Jordan and Sri Lanka, a 2019 study in scientific journal Joule found. (source) - ie this game is an entire additional country of environmental damage for ~no useful gain\nno one has found an actual use for any of it except for onlin

### Get median length comment to evaluate model's response time later

In [22]:
# 21 tokens comment is 50th percentile
df[df['comment_tokens_count'] == 21]['Comment Text'].values[0]

'His first company ticket monster was a major flop, which was the critical reason why I never touched that dog shit.'

### Are there any emojis in the comments?

In [24]:
df['has_emoji'] = df['Comment Text'].apply(lambda txt: any(emoji.is_emoji(token) for token in txt.strip().split()))

In [26]:
df['has_emoji'].value_counts(normalize=True)

False    0.972826
True     0.027174
Name: has_emoji, dtype: float64

In [29]:
df[df['has_emoji']][['Comment Text', 'Sentiment']]

Unnamed: 0,Comment Text,Sentiment
19,"Literally crashing to zero in real time atm, Luna will be worth less than UST 🤣\n\n",Negative
88,"People don't want to accept that crypto is a seriously risky and speculative asset, arguably one of the most speculative because there is absolutely 0 metrics that set a ""floor"" on the price- there are no revenue, there is no EPS, there are no buildings or machines or IP that have a hard value- There is nothing to weigh and look at to say ""Holy fuck, this is very undervalued right now i should buy""....the only thing propping up crypto prices is ""feels""\n\nAnd all this money is coming out of the same bucket because crypto isn't a closed loop, meaning you don't get paid in crypto and transact in crypto directly, you have to buy it with fiat and when you buy something with it the vendor then has to exchange it back to fiat for it to have any real use for them because its not a medium of exchange excepton the very razors edge of the margin....\n\nAs the economy gets worse and prices keep going up people simply have less money, so you are going to have a lot more sellers than buyers willing to put their fiat into the riskiest of assets when they have to pay 50% more for food and gas and might lose their job soon.\n\nThis is what happens every downturn, people flee risky speculative assets 🤷‍♂️\n\nIts going to get worse, we haven't really seen any layoffs yet but it's going to happen....Crypto is really going to take a shit when that happens because, again, all the money is coming from the same bucket\n\nMy point is that Crypto markets will never ""decouple"" from the wider financial markets, that's a complete pipe dream imo",Negative
109,"I doubled down on Solana, Cardano, and Algorand 😳",Positive
138,The funny thing is that people have already added more liquidity. You can see on Algo Explorer that folks have already put some money in to your shid coin unless those are other wallets you’re associated with.\n\nWhich is to say: case in point. People will buy 💩 even if it’s labeled very clearly on the packaging and smells very much like 💩.,Negative
169,Yaaas 🙏🏼 buy more $sol,Positive
192,Doge to MARS 🚀,Positive
202,Dump it.... It's a 💩 coin,Negative
232,Unbelievable. I lost $2k to Do Kwon which was a third of my whole portfolio and for me a lot of money. 😒\n\nI’m really done with this guy. 🙄,Negative
250,"This is why I love ❤ the bitcoin community it's not just about fiat\n\nI hope I'll be around and alive one day when people will say ""how much is that in satoshis?""\n\nNot bothering in converting into state fiat.",Positive
341,LUNC all the way 🚀,Positive


## Are there any urls in the comments?

In [30]:
URL_PATTERN = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()@:%_+.~#?&/=]*)'

def has_url(text):
    community_search = re.search(URL_PATTERN, text, re.IGNORECASE)
    return community_search is not None

In [31]:
df['has_url'] = df['Comment Text'].apply(has_url)

In [32]:
df['has_url'].value_counts(normalize=True)

False    0.992754
True     0.007246
Name: has_url, dtype: float64

In [33]:
df[df['has_url']][['Comment Text', 'Sentiment']]

Unnamed: 0,Comment Text,Sentiment
40,"Environmental Impact: Bitcoin alone, has the environmental impact (Co2 emissions) of some small countries. For a 'currency' that is primarily used for stored value, not transactions, this is incredibly problematic. Assuming there will be no transformational breakthroughs in computing efficiency, proof of work is going to be with us, and obfuscating the long term value of crypto. - https://news.climate.columbia.edu/2021/09/20/bitcoins-impacts-on-climate-and-the-environment/\n\nComplexity: My smart, non-techy friend literally just put 40k into Titan's crypto fund because he didn't trust himself to not make a costly error in purchasing the coins I advised him on. He's in his early 30s and works for a tech company, still not confident of transacting in the coins themselves. That's going to continue to be a problem of broad adoption.\n\nScammy coins: This thread loves to shit on Elizabeth Warren for trying to regulate the crypto space, I think her heart is in the right place, but often gets the details wrong - BUUUUUT if we want broad adoption, we need zero news stories of people getting rug pulled, hacked, etc. Regulation specifically around verification of project fundamentals would be incredibly valuable to giving consumers confidence in what they're purchasing, prevent people from being rug pulled and becoming a crypto detractor.\n\nLittle/edge case consumer utility: Self explanatory. Until there are actual broad consumer market applications we're all still rolling dice in a casino.",Negative
363,"Green energy is already widely used for mining Bitcoin.\n\nhttps://finance.yahoo.com/news/green-energy-sustainable-future-bitcoin-232841036.html\n\n According to a recent report, bitcoin miners have already been using 56% of their total electricity through sustainable or renewable sources. For the members of the council, the usage is even better, 67.6%.",Positive
408,"I hate this crypto garbage\nI made a post about paying anyone 0.001btc for convincing me about cryptos usefulness. Lo an behold, I have come to realise that the ""coins"" I had for past few years are actually not there.. Imagine my surprise. I loaded $100 into my jaxx wallet a few years ago just because why not? Well thankfull it was only $100 because ofcourse I lost it. My wallet is stuck on 'initializing' whatever that means and I cant do anything: https://ibb.co/7Xv0NF1\n\nNever has a bank just lost my money like that. Crypto is the biggest fucking scam in the universe. Get out of it while you still can.\n\nUPDATE:\n\nOk so winzupdatee showed me how to fix my crap assbitcoin and as promised, I sent him 0.001btc as well as my fav answer from previous post dispite not being convinced. So to whoever said I wasnt going to pay up, I forgive you.\n\nALSO, this experience has only reinforced my distrust of crypto. I paid almost TEN DOLLARS to transfer like 40. THATS A 25% FEE LOL. How the hell can anyone advocate for a service that takes more fees than private lender banks for interest? Yea I get it, you can anonymously send money anywhere instantly, but apart from this dumb excercise on reddit, I cant fathom any reason or use for this especially not a ""store of value"". All civil and polite discussions are appreciated.",Negative
410,"People have used salt, seashells, cigarettes etc as money. Because money is what we call the good or tool that is used as ""money"".\n\nThe is used to store value. To exchange for other goods. As a unit of account.\n\nA goods or tool becomes money when it is used as money.\n\nIn a prison cigarettes becomes money when that is the goods used for money.\n\nDifferent goods or tools can be better or worse money depending on the qualities it has.\n\nThere is no requirement that it has to be tangible or physical.\n\nThe requirements are that is has to do its job more or less. It has to work as a:\n\nStore of value Means of exchange Unit of account\n\nAnd to do so it has to be fungible, transportable and durable easy to verify, etc.\n\nBitcoin has all those properties and is designed to excel at all of them. Some say it does not store value, because it is too volatile, but bitcoin does not contain volatility, the volatility is caused by other humans. And volatility is not a problem as long as it has remained in value or increase in value between time A and time B the volatility in between does not matter at all. So far that requires could require that you hold it for more than 2-4 years but it will go down and become better and when it does it becomes even more valuable as more and more people see that it is here to stay and see that they can trust that it stores value and even increase in value.\n\nSo far it have and it seems like a given that it will continue to work and I would not bet against Bitcoin.\n\nGold is not bad, but today it does not work as a means of exchange for daily items and it probably can not as people will not go back from digital. Digital is superior at saving time and being verifiable.\n\nhttps://www.thoughtco.com/what-is-money-1147763\n\n",Positive
