# Part 2 -- Pre-Processing and Text Cleaning

Next we're going to do some basic pre-processing, like converting the text to lower case and deleting URLs. We don't need to remove stopwords such as "the" and "a", which will be done when we pass our data through a NLTK model later. 

### Load lib codes:

In [1]:
from os import chdir
chdir('/home/jovyan/work/Portfolio/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()

In [2]:
!pip install pymongo
import pymongo

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
# Identify port for better security of MongoDB
cli = pymongo.MongoClient(host='52.27.11.214', port=27016)

In [4]:
completed_collection = cli.twitter_db.completed_collection
cli.twitter_db.collection_names()

['completed_collection', 'task_collection']

**Clean data**

In [7]:
# Get data out of our MongoDB collection
tweets_list = [document for document in completed_collection.find()]
tweets_df = pd.DataFrame(tweets_list)
tweets_df.head(2)

Unnamed: 0,_id,text,timestamp,username
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama


In [8]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94856 entries, 0 to 94855
Data columns (total 4 columns):
_id          94856 non-null object
text         94856 non-null object
timestamp    94856 non-null datetime64[ns]
username     94856 non-null object
dtypes: datetime64[ns](1), object(3)
memory usage: 2.9+ MB


In [10]:
# Make our text into a list (instead of a series) so we can apply functions to it
tweets_text = []
for i in range(len(tweets_df)):
    tweets_text.append(tweets_list[i]['text'])

In [11]:
import re

In [12]:
def clean_text(text):
    text = re.sub('(http\S+)', '', text)
    text = re.sub('(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)', '', text)
    text = re.sub('[\W]', ' ', text)
    text = re.sub('(\@.*?(?=\s))', ' ', text)
    
    text = re.sub('([\s]{2,})', ' ', text)
    text = text.lower()
    return text

In [13]:
cleaned_text = []
for t in tweets_text:
    cleaned_text.append(clean_text(t))

In [14]:
cleaned_text[0]

'on this national gun violence awareness day let your voice be heard and show your commitment to reducing gun viole '

In [15]:
tweets_df['cleaned_text'] = cleaned_text

In [18]:
tweets_df['Date'] = tweets_df['timestamp'].dt.date
tweets_df['Date'] = pd.to_datetime(tweets_df['Date'])

In [19]:
tweets_df.sample(5)

Unnamed: 0,_id,text,timestamp,username,cleaned_text,Date
65961,593dec3157bbd404766523fc,#Qatar Airways has ~23 flights to the UAE each...,2017-06-07 16:42:33,steve_hanke,qatar airways has 23 flights to the uae each ...,2017-06-07
80626,593dec5f57bbd40476655d49,RT @DonDraperClone: Inside Obama’s secret outr...,2017-05-29 13:17:41,zerohedge,rt dondraperclone inside obama s secret outrea...,2017-05-29
48141,593debfd57bbd4047664de5a,RT @McFaul: Why? https://t.co/2A1WaY1X0x,2017-04-18 02:05:53,sacca,rt mcfaul why,2017-04-18
4458,593deb7c57bbd404766433a9,RT @CNNMoney: Dollar dives 0.6% after Presiden...,2017-04-12 20:07:16,cnnbrk,rt cnnmoney dollar dives 0 6 after president t...,2017-04-12
53771,593dec0e57bbd4047664f45a,Beaten-down healthcare stocks are among top di...,2017-04-01 13:00:04,ForbesInvestor,beaten down healthcare stocks are among top di...,2017-04-01


In [20]:
tweets_df['Date'].dtype

dtype('<M8[ns]')

In [21]:
tweets_df.shape

(94856, 6)

In [22]:
joblib.dump(tweets_df, '../Analyzing_Unstructured_Data_for_Finance/data/2.tweets_df.pickle')

['../Analyzing_Unstructured_Data_for_Finance/data/2.tweets_df.pickle']

In [23]:
# Use the dates to determine how the date range we want to pull stock data from
print(min(tweets_df['Date']))
print(max(tweets_df['Date']))

2008-09-10 00:00:00
2017-06-12 00:00:00
