# Part 2 -- Pre-Processing and Text Cleaning

Next we're going to do some basic pre-processing, like converting the text to lower case and deleting URLs. We don't need to remove stopwords such as "the" and "a", which will be done when we pass our data through a NLTK model later. 

### Load lib codes:

In [1]:
from os import chdir
chdir('/home/jovyan/work/Portfolio/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()

In [2]:
!pip install pymongo
import pymongo

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
# Identify port for better security of MongoDB
cli = pymongo.MongoClient(host='52.27.11.214', port=27016)

In [4]:
completed_collection = cli.twitter_db.completed_collection

cli.twitter_db.collection_names()

['completed_collection', 'task_collection']

**Clean data**

In [5]:
# Get data out of our MongoDB collection
tweets_list = [document for document in completed_collection.find()]
tweets_df = pd.DataFrame(tweets_list)
tweets_df.head()

Unnamed: 0,_id,text,timestamp,username
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama
2,593deb7057bbd40476642240,Good to see my friend Prince Harry in London t...,2017-05-27 13:15:25,BarackObama
3,593deb7057bbd40476642241,"Through faith, love, and resolve the character...",2017-05-25 14:13:35,BarackObama
4,593deb7057bbd40476642242,Our hearts go out to those killed and wounded ...,2017-05-23 16:56:14,BarackObama


In [6]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94856 entries, 0 to 94855
Data columns (total 4 columns):
_id          94856 non-null object
text         94856 non-null object
timestamp    94856 non-null datetime64[ns]
username     94856 non-null object
dtypes: datetime64[ns](1), object(3)
memory usage: 2.9+ MB


In [7]:
tweets_df['Date'] = tweets_df['timestamp'].dt.date

In [8]:
tweets_df.sample(5)

Unnamed: 0,_id,text,timestamp,username,Date
88615,593dec7757bbd40476657c81,Asian stock markets mixed as investors sweat U...,2017-06-09 02:23:44,FinancialTimes,2017-06-09
65056,593dec2e57bbd40476652072,Apple is trading in a way that has ONLY happen...,2016-09-15 15:39:15,StockTwits,2016-09-15
87234,593dec7257bbd4047665771b,RT @KristenScholer: Those betting that the cur...,2016-03-07 13:53:56,WSJMoneyBeat,2016-03-07
8839,593deb8857bbd404766444c7,Many people are asking how hey can help in Hai...,2010-01-20 18:59:40,BillGates,2010-01-20
15402,593deb9b57bbd40476645e6d,Two minutes to me,2017-06-08 12:51:06,jimcramer,2017-06-08


In [9]:
# Make our text into a list (instead of a series) so we can apply functions to it
tweets_text = []
for i in range(len(tweets_df)):
    tweets_text.append(tweets_list[i]['text'])

In [10]:
import re

In [11]:
def clean_text(text):
    text = re.sub('(http\S+)', '', text)
    text = re.sub('(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)', '', text)
    text = re.sub('[\W]', ' ', text)
    text = re.sub('(\@.*?(?=\s))', ' ', text)
    
    text = re.sub('([\s]{2,})', ' ', text)
    text = text.lower()
    return text

In [12]:
cleaned_text = []
for t in tweets_text:
    cleaned_text.append(clean_text(t))

In [13]:
cleaned_text[0]

'on this national gun violence awareness day let your voice be heard and show your commitment to reducing gun viole '

In [14]:
tweets_df['cleaned_text'] = cleaned_text

In [15]:
tweets_df.sample(5)

Unnamed: 0,_id,text,timestamp,username,Date,cleaned_text
58756,593dec1c57bbd404766507d4,"Alas, you can never* predict exactly what the ...",2015-04-09 18:07:00,themotleyfool,2015-04-09,alas you can never predict exactly what the ma...
74803,593dec4c57bbd40476654688,Wacky day on Wall Street. Nasdaq was down more...,2014-10-15 19:56:55,ReutersInsider,2014-10-15,wacky day on wall street nasdaq was down more ...
5600,593deb7f57bbd4047664381f,Trump says new travel ban executive order will...,2017-02-16 19:38:41,cnnbrk,2017-02-16,trump says new travel ban executive order will...
4603,593deb7c57bbd4047664343a,Secret Service agent on VP Pence's detail susp...,2017-04-05 23:21:54,cnnbrk,2017-04-05,secret service agent on vp pence s detail susp...
45442,593debf357bbd4047664d3ce,No more Florida: see what Earth would look lik...,2017-05-09 15:00:07,Forbes,2017-05-09,no more florida see what earth would look like...


In [16]:
tweets_df.shape

(94856, 6)

In [17]:
pd.to_pickle(tweets_df, '../Analyzing_Unstructured_Data_for_Finance/data/2.tweets_df.pickle')

In [18]:
# Use the oldest date to determine how far back we want to pull stock data from
min(tweets_df['Date'])

datetime.date(2008, 9, 10)