## Data Cleaning

The collected data has several issues observed. First, the majority of tweets include pictures or URL links. Second, our data is suffering from an imbalance, where more of our data is classified as negative. The following process below will address the two main issues and prepare the data for modeling. Since the raw data was used for this step, print out statements of the data frame was removed to protect users' privacy. 

In [1]:
#Imported Libraries
import re
import pickle 
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

### Pre-processing and Feature Engineering: 

The code below will first load the scraped data into a data frame and fill in any null values from the label column. Feature engineering was conducted by labeling tweets with a website link as positive (1), if 'http' is found in a tweet, and negative (0) if not. The tweet column is then run into a cleaning function that removes URL, hashtags, @(username), "pic.twitter...", and non-English characters. 

**Note:** During the labeling process, tweets that are classified as negative were left blank. Therefore, all blanks were filled with 0.

In [2]:
#Read data
train = pd.read_csv('../data/raw_training.csv', index_col=0).drop(columns=['user', 'date'])

#Fill all missing value with 0 
train['label'].fillna(0, inplace=True)

#Label tweet with a link (has 'http' in the content) as 1 
train['website_link'] = train['tweet'].astype('str').map(lambda x: 1 if 'http' in x else 0)

#Function that removes URL, hashtag, @username, non-english characters and "pic.twitter..." from tweets. 
def cleaning(word):
    '''Replace URL, hashtags, @useername, non-english characters and pic.twitter with an empty string. 
    Returns cleaned tweet'''
    word = word.lower()
    word = re.sub("http(\s+|\W+|\w+)+|@(\s+|\W+|\w+)|#(\w+|\W+)", "", word) #Removes hashtag, URL, and @username
    word = re.sub('(pic.twitter.com.)\w+', "", word) #Removes sentence that starts with pic.twitter
    word = re.sub('[^a-zA-z\s]', "", word) #Filter out non-english characters  
    return word

# Apply function to clean out tweets 
train['tweet'] = train['tweet'].astype('str').map(cleaning)

### Addressing Data Imbalance
The data is severely imbalanced, with 94% classified as negative (0) and only 6% as positive (1), to resolve the issue, the positive class (minority) will be upsampled with replacement. In contrast, the negative class (majority) will be downsampled. 

In [3]:
train['label'].value_counts(normalize=True)

0.0    0.943392
1.0    0.056608
Name: label, dtype: float64

A total of 550 data will be sampled from our dataset, where 300 will come from the positive class, while 250 will be from the negative class. Since our original positive class only has 286 data, an upsampling with replacement will be done to get "additional" data. **The defined baseline score is 54%**

Code Resource: https://elitedatascience.com/imbalanced-classes

In [5]:
#Upsampling and Downsampling data 
majority = train.loc[train['label'] == 0]
minority = train.loc[train['label'] == 1]

majority_downsample = resample(majority,
                              replace=False,
                              n_samples=250,
                              random_state=42)

minority_upsample = resample(minority,
                            replace=True,
                            n_samples=300,
                            random_state=42)

#Concatenate minority_upsample and majority_downsample and overwrite train with new data. 
train = pd.concat([majority_downsample, minority_upsample])
print(train['label'].value_counts(normalize=True))

#Save new dataset for BERT modeling
train.to_csv('../data/train_clean.csv')

1.0    0.545455
0.0    0.454545
Name: label, dtype: float64


### Tokenizing Data and Feature Selection:
The cleaned resampled data will be tokenized using a TfidVectorizer. Wherein it removes stopwords and tokenize text into one to four-word combinations (n_grams). Tokenized data will then be fitted in a Logistic Regression model to get tokens' coefficients. Selected features will have an absolute value of coefficient greater than 0.05. The isolated features and vectorizer are pickled to be used in the modeling and prediction steps. 

In [10]:
#Tokenize the tweet column
tf = TfidfVectorizer(max_features= 100, stop_words='english', ngram_range=(1,4))
Z = tf.fit_transform(train['tweet'])
y = train['label']

#Fit a Logistic Regression model to tokenize tweets and save coefficient result into a data frame
lr_1 = LogisticRegression()
lr_1.fit(Z, y)
coef_df = pd.DataFrame(lr_1.coef_, columns= tf.get_feature_names()).T

#Select only features that have an absolute coefficient > 0.05
feature_list = list(coef_df.loc[np.abs(coef_df[0]) > 0.05].sort_values(0, ascending=False).index)

#Store tokenize data into a data frame and combine it with cleaned dataset
tokens = pd.DataFrame(Z.toarray(), columns=tf.get_feature_names(), index=train.index)
train = pd.concat([train, tokens], axis=1)

#Save clean tokenize data in csv for modeling 
train.to_csv('../data/train_clean_token.csv')

#Pickle feature list and vectorizer to be use for modeling
pd.to_pickle(feature_list, 'pickle/feature.pkl')
pd.to_pickle(tf, 'pickle/vectorizer.pkl')

#### Note:  Since BERT requires big memory it will be done through google colab. The data cleaned in this notebook will be imported to the google colab to be used for fine-tuning and modeling BERT.