# Part b1 -- Pre-Processing and Text Cleaning

Next we're going to do some basic pre-processing, like converting the text to lower case and deleting URLs. We don't need to remove stopwords such as "the" and "a", which will be done when we pass our data through a NLTK model later. 

**Load lib codes**

In [1]:
from os import chdir
chdir('/home/jovyan/work/Analyzing_Unstructured_Data_for_Finance/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()

In [2]:
!pip install pymongo
import pymongo

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
# Identify port for better security of MongoDB
cli = pymongo.MongoClient(host='52.27.11.214', port=27016)

In [4]:
completed_collection = cli.twitter_db.completed_collection
cli.twitter_db.collection_names()

['completed_collection', 'task_collection']

**Clean data**

In [5]:
# Get data out of our MongoDB collection
tweets_list = [document for document in completed_collection.find()]
tweets_df = pd.DataFrame(tweets_list)
tweets_df.head(2)

Unnamed: 0,_id,text,timestamp,username
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama


In [6]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94856 entries, 0 to 94855
Data columns (total 4 columns):
_id          94856 non-null object
text         94856 non-null object
timestamp    94856 non-null datetime64[ns]
username     94856 non-null object
dtypes: datetime64[ns](1), object(3)
memory usage: 2.9+ MB


In [7]:
# Make our text into a list (instead of a series) so we can apply functions to it
tweets_text = []
for i in range(len(tweets_df)):
    tweets_text.append(tweets_list[i]['text'])

In [8]:
import re

In [9]:
def clean_text(text):
    text = re.sub('(http\S+)', '', text)
    text = re.sub('(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)', '', text)
    text = re.sub('[\W]', ' ', text)
    text = re.sub('(\@.*?(?=\s))', ' ', text)
    text = re.sub('[\W, \d]', ' ', text)
    text = re.sub('(\s[a-z]{1}\s)', ' ', text)
                  
    text = re.sub('([\s]{2,})', ' ', text)
    text = text.lower()
    return text

In [10]:
cleaned_text = []
for t in tweets_text:
    cleaned_text.append(clean_text(t))

In [11]:
cleaned_text[0]

'on this national gun violence awareness day let your voice be heard and show your commitment to reducing gun viole '

In [12]:
tweets_df['cleaned_text'] = cleaned_text

In [13]:
tweets_df['Date'] = tweets_df['timestamp'].dt.date
tweets_df['Date'] = pd.to_datetime(tweets_df['Date'])

In [14]:
tweets_df.sample(5)

Unnamed: 0,_id,text,timestamp,username,cleaned_text,Date
17668,593deba457bbd40476646747,Good Morning America and Fox News where i crea...,2017-04-08 18:16:48,jimcramer,good morning america and fox news where create...,2017-04-08
83397,593dec6757bbd4047665681d,#Stocktoberfest sponsor @moderntradermag is gi...,2016-09-15 16:59:37,abnormalreturns,stocktoberfest sponsor moderntradermag is giv...,2016-09-15
75318,593dec4d57bbd4047665488b,Check out our Punter's Premier League Preview ...,2014-08-14 13:48:26,ReutersInsider,check out our punter premier league preview ci...,2014-08-14
27495,593debc157bbd40476648dad,Petting zoo left a chicken behind at the park....,2013-05-18 23:44:07,elonmusk,petting zoo left chicken behind at the park no...,2013-05-18
51569,593dec0857bbd4047664ebbf,Saving the euro from itself is a work in progr...,2017-02-06 23:40:05,MktsInsider,saving the euro from itself is work in progres...,2017-02-06


In [15]:
tweets_df['Date'].dtype

dtype('<M8[ns]')

In [16]:
tweets_df.shape

(94856, 6)

In [17]:
sp500_df = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/3.sp500_df.pickle')

In [18]:
def match_dates_and_pull_y(features_df, target_df):
    change = []

    for i in features_df['Date']:
        try:
            if i in list(target_df['Date']):
                change.append(target_df['Percent_Change_Class'].loc[target_df['Date']==i].values)
            elif i not in list(target_df['Date']):
                change.append('x')
                pass
        except Exception as e:
            print('Error:', e)
            
    return pd.DataFrame(change)

In [20]:
start = datetime.now()

change_df = match_dates_and_pull_y(tweets_df, sp500_df)

end = datetime.now()
print(end - start)

0:13:38.632550


In [21]:
change_df[0].value_counts()

down    43367
up      33891
x       17514
n/a        84
Name: 0, dtype: int64

In [22]:
combined_df_nodrop = tweets_df.merge(change_df, left_index=True, right_index=True)

In [23]:
combined_df = combined_df_nodrop[combined_df_nodrop[0]!='x']

In [24]:
combined_df = combined_df[combined_df[0]!='n/a']

In [25]:
joblib.dump(combined_df, '../Analyzing_Unstructured_Data_for_Finance/data_cleaned/b1.combined_df.pickle')

['../Analyzing_Unstructured_Data_for_Finance/data_cleaned/b1.combined_df.pickle']

**Now split the data into X's and y's (our features and target)**

In [26]:
X = combined_df.drop(0, axis=1)
y = combined_df[0]

In [27]:
print(X.shape)
print(y.shape)

(77258, 6)
(77258,)


In [28]:
joblib.dump(X, '../Analyzing_Unstructured_Data_for_Finance/data_cleaned/b1.X.pickle')

['../Analyzing_Unstructured_Data_for_Finance/data_cleaned/b1.X.pickle']

In [29]:
joblib.dump(y, '../Analyzing_Unstructured_Data_for_Finance/data_cleaned/b1.y.pickle')

['../Analyzing_Unstructured_Data_for_Finance/data_cleaned/b1.y.pickle']