### About
This notebook processes the original dataset.

1. Converts sentiment to binary value
1. Drop neutral values
1. Save dataset as `train.csv` and `test.csv`

### Source
Data was taken from this [Kaggle Dataset](https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification/data?select=Corona_NLP_train.csv)

In [1]:
import pandas as pd
test, train = pd.read_csv("data/Corona_NLP_test.csv.xls"), pd.read_csv("data/Corona_NLP_train.csv", encoding = "ISO-8859-1")

In [2]:

test.head()
print(test.shape)
print(train.shape)

(3798, 6)
(41157, 6)


In [3]:
def convert_sentiment(sentiment):
    """Convert sentiment to binary value. Defaults to None
    """
    binary_sentiment = None
    if isinstance(sentiment,str):
        if 'negative' in sentiment.lower():
            binary_sentiment = 'negative'
        elif 'positive' in sentiment.lower():
            binary_sentiment = 'positive'

    return binary_sentiment


test['Sentiment'] = test['Sentiment'].apply(convert_sentiment)
train['Sentiment'] = train['Sentiment'].apply(convert_sentiment)

# sentiment attribute should now be binary
test.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,positive
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,negative
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,


In [4]:
test = test.dropna()
train = train.dropna()

# dataset should shrink since we filtered out the neutral/missing data
print(test.shape)
print(train.shape)
test.head()

(2467, 6)
(26395, 6)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,negative
9,10,44962,"Dublin, Ireland",04-03-2020,Anyone been in a supermarket over the last few...,positive
10,11,44963,"Boksburg, South Africa",04-03-2020,Best quality couches at unbelievably low price...,positive


In [5]:
# Save data
train.to_csv('data/train.csv')
test.to_csv('data/test.csv')