# Part 4 -- Combine Features and Target

Since our feature data (tweets) and target data (stocks) are pulled from different sources and have different numbers of rows and columns, we will need to do some cleaning to make sure our X's and y's share the same number of rows. This is important for when we feed our data into models.

**Load lib codes**

In [1]:
from os import chdir
chdir('/home/jovyan/work/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()

In [2]:
sp500_df = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/3.sp500_df.pickle')

In [3]:
tweets_df = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/2.tweets_df.pickle')

In [4]:
sp500_df.shape

(2013, 10)

In [5]:
tweets_df.shape

(94856, 6)

**Do some checks to see if everything matches correctly before merging the dataframes**

In [6]:
sp500_df[2000:2002]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Diff,Percent_Change,Percent_Change_Class
2000,2017-05-24,2401.409912,2405.580078,2397.98999,2404.389893,2404.389893,3389900000,5.969971,0.002483,up
2001,2017-05-25,2409.540039,2418.709961,2408.01001,2415.070068,2415.070068,3535390000,10.680176,0.004422,up


In [8]:
tweets_df.head(4)

Unnamed: 0,_id,text,timestamp,username,cleaned_text,Date
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama,on this national gun violence awareness day le...,2017-06-02
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama,forever grateful for the service and sacrifice...,2017-05-29
2,593deb7057bbd40476642240,Good to see my friend Prince Harry in London t...,2017-05-27 13:15:25,BarackObama,good to see my friend prince harry in london t...,2017-05-27
3,593deb7057bbd40476642241,"Through faith, love, and resolve the character...",2017-05-25 14:13:35,BarackObama,through faith love and resolve the character o...,2017-05-25


In [9]:
tweets_df['Date'][3]==sp500_df['Date'][2001]

True

In [10]:
def match_dates_and_pull_y(features_df, target_df):
    change = []

    for i in features_df['Date']:
        try:
            if i in list(target_df['Date']):
                change.append(target_df['Percent_Change_Class'].loc[target_df['Date']==i].values)
            elif i not in list(target_df['Date']):
                change.append('x')
                pass
        except Exception as e:
            print('Error:', e)
            
    return pd.DataFrame(change)

In [11]:
start = datetime.now()

change_df = match_dates_and_pull_y(tweets_df, sp500_df)

end = datetime.now()
print(end - start)

0:13:57.921029


In [12]:
change_df[0].value_counts()

down    43367
up      33891
x       17514
n/a        84
Name: 0, dtype: int64

In [13]:
43367+33891+17514+84

94856

**Now we have our X's and y's**<br>
Our <u>Features</u> == Tweets (\_id, text, timestamp, username, cleaned_text, Date)<br>
Our <u>Target</u> == change in stock prices (up/down/neutral)<br>

They have the same number of rows and are in the same order. Now, we just need to drop the rows where our X's and y's don't have 'Date' in common.

**Merge & drop rows where there's no stock data for Tweets**

In [14]:
combined_df_nodrop = tweets_df.merge(change_df, left_index=True, right_index=True)

In [15]:
combined_df_nodrop.head(2)

Unnamed: 0,_id,text,timestamp,username,cleaned_text,Date,0
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama,on this national gun violence awareness day le...,2017-06-02,up
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama,forever grateful for the service and sacrifice...,2017-05-29,x


In [16]:
combined_df = combined_df_nodrop[combined_df_nodrop[0]!='x']

In [17]:
combined_df = combined_df[combined_df[0]!='n/a']

In [18]:
combined_df.head(2)

Unnamed: 0,_id,text,timestamp,username,cleaned_text,Date,0
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama,on this national gun violence awareness day le...,2017-06-02,up
3,593deb7057bbd40476642241,"Through faith, love, and resolve the character...",2017-05-25 14:13:35,BarackObama,through faith love and resolve the character o...,2017-05-25,up


In [19]:
joblib.dump(combined_df, '../Analyzing_Unstructured_Data_for_Finance/data/4.combined_df.pickle')

['../Analyzing_Unstructured_Data_for_Finance/data/4.combined_df.pickle']

**Now split the data into X's and y's (our features and target)**

In [20]:
X = combined_df.drop(0, axis=1)
y = combined_df[0]

In [21]:
print(X.shape)
print(y.shape)

(77258, 6)
(77258,)


In [22]:
77258.0/94856.0
# We kept 81% of original tweets (after dropping ones that don't match)

0.814476680441933

In [23]:
joblib.dump(X, '../Analyzing_Unstructured_Data_for_Finance/data/4.X.pickle')

['../Analyzing_Unstructured_Data_for_Finance/data/4.X.pickle']

In [24]:
joblib.dump(y, '../Analyzing_Unstructured_Data_for_Finance/data/4.y.pickle')

['../Analyzing_Unstructured_Data_for_Finance/data/4.y.pickle']

In [25]:
combined_df[0].value_counts()

down    43367
up      33891
Name: 0, dtype: int64