# Part 2 -- Pre-Processing and Text Cleaning

Next we're going to do some basic pre-processing, like converting the text to lower case and deleting URLs. We don't need to remove stopwords such as "the" and "a", which will be done when we pass our data through a NLTK model later. 

**Load lib codes**

In [1]:
from os import chdir
chdir('/home/jovyan/work/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()

In [4]:
!pip install pymongo
import pymongo

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [5]:
# Identify port for better security of MongoDB
cli = pymongo.MongoClient(host='52.27.11.214', port=27016)

In [6]:
completed_collection = cli.twitter_db.completed_collection
cli.twitter_db.collection_names()

['completed_collection', 'task_collection']

**Clean data**

In [7]:
# Get data out of our MongoDB collection
tweets_list = [document for document in completed_collection.find()]
tweets_df = pd.DataFrame(tweets_list)
tweets_df.head(2)

Unnamed: 0,_id,text,timestamp,username
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama


In [8]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94856 entries, 0 to 94855
Data columns (total 4 columns):
_id          94856 non-null object
text         94856 non-null object
timestamp    94856 non-null datetime64[ns]
username     94856 non-null object
dtypes: datetime64[ns](1), object(3)
memory usage: 2.9+ MB


In [9]:
# Make our text into a list (instead of a series) so we can apply functions to it
tweets_text = []
for i in range(len(tweets_df)):
    tweets_text.append(tweets_list[i]['text'])

In [10]:
import re

In [11]:
def clean_text(text):
    text = re.sub('(http\S+)', '', text)
    text = re.sub('(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)', '', text)
    text = re.sub('[\W]', ' ', text)
    text = re.sub('(\@.*?(?=\s))', ' ', text)
    
    text = re.sub('([\s]{2,})', ' ', text)
    text = text.lower()
    return text

In [12]:
cleaned_text = []
for t in tweets_text:
    cleaned_text.append(clean_text(t))

In [13]:
cleaned_text[0]

'on this national gun violence awareness day let your voice be heard and show your commitment to reducing gun viole '

In [14]:
tweets_df['cleaned_text'] = cleaned_text

In [15]:
tweets_df['Date'] = tweets_df['timestamp'].dt.date
tweets_df['Date'] = pd.to_datetime(tweets_df['Date'])

In [16]:
tweets_df.sample(5)

Unnamed: 0,_id,text,timestamp,username,cleaned_text,Date
26082,593debbe57bbd40476648828,Landing https://t.co/dRWGyyTtCH,2016-07-18 20:42:23,elonmusk,landing,2016-07-18
30414,593debc957bbd40476649915,".@BetteMidler brings her best to “Hello, Dolly...",2017-04-27 12:00:24,NewYorker,bettemidler brings her best to hello dolly,2017-04-27
48244,593debfe57bbd4047664dec1,RT @RyanBartholomee: Llama ask you if you are ...,2017-04-15 01:07:18,sacca,rt ryanbartholomee llama ask you if you are wa...,2017-04-15
18124,593deba657bbd4047664690f,thank you! https://t.co/mdq19gHBOm,2017-03-29 13:49:54,jimcramer,thank you,2017-03-29
45703,593debf557bbd4047664d4d3,This new Tinder alternative is using artificia...,2017-05-05 04:00:01,Forbes,this new tinder alternative is using artificia...,2017-05-05


In [17]:
tweets_df['Date'].dtype

dtype('<M8[ns]')

In [18]:
tweets_df.shape

(94856, 6)

In [19]:
joblib.dump(tweets_df, '../Analyzing_Unstructured_Data_for_Finance/data/2.tweets_df.pickle')

['../Analyzing_Unstructured_Data_for_Finance/data/2.tweets_df.pickle']

In [20]:
# Use the dates to determine how the date range we want to pull stock data from
print(min(tweets_df['Date']))
print(max(tweets_df['Date']))

2008-09-10 00:00:00
2017-06-12 00:00:00
