# Part 4 -- Combine Features and Target (SP500)

Since our feature data (Tweets) and target data (stocks) are pulled from different sources and have different numbers of rows and columns, we will need to do some cleaning to make sure our X's and y's share the same number of rows. This is important for when we feed our data into models.

**Load lib codes**

In [1]:
from os import chdir
chdir('/home/jovyan/work/Portfolio/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()

In [5]:
SP500_df = pd.read_pickle('../Analyzing_Unstructured_Data_for_Finance/data/3.2.sp500_stocks_df.pickle')

In [3]:
tweets_df = pd.read_pickle('../Analyzing_Unstructured_Data_for_Finance/data/2.tweets_df.pickle')

In [6]:
SP500_df.shape

(2014, 9)

In [6]:
tweets_df.shape

(94856, 6)

**Do some checks to see if everything matches correctly before merging the dataframes**

In [12]:
SP500_df[2000:2005]

Unnamed: 0,Date,Close,Diff,High,Low,Open,Percent_Change,Volume,Percent_Change_Class
2000,2017-05-22,97.444651,0.586108,97.975369,96.590958,97.107944,0.005277,3692048.0,neutral
2001,2017-05-23,97.311697,-0.132954,98.053313,96.650339,97.431218,-0.000309,3514729.0,down
2002,2017-05-24,97.643273,0.331577,98.136727,96.762176,97.410639,0.001673,3571004.0,down
2003,2017-05-25,98.26,0.616727,98.941357,97.348224,97.950379,0.003405,4014203.0,down
2004,2017-05-26,98.282655,0.022655,98.837505,97.663014,98.257665,-0.000145,3026608.0,down


In [13]:
tweets_df.head(5)

Unnamed: 0,_id,text,timestamp,username,Date,cleaned_text
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama,2017-06-02,on this national gun violence awareness day le...
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama,2017-05-29,forever grateful for the service and sacrifice...
2,593deb7057bbd40476642240,Good to see my friend Prince Harry in London t...,2017-05-27 13:15:25,BarackObama,2017-05-27,good to see my friend prince harry in london t...
3,593deb7057bbd40476642241,"Through faith, love, and resolve the character...",2017-05-25 14:13:35,BarackObama,2017-05-25,through faith love and resolve the character o...
4,593deb7057bbd40476642242,Our hearts go out to those killed and wounded ...,2017-05-23 16:56:14,BarackObama,2017-05-23,our hearts go out to those killed and wounded ...


In [16]:
tweets_df['Date'][3]==SP500_df['Date'][2003]

True

In [17]:
SP500_df[SP500_df['Percent_Change_Class']=='neutral']

Unnamed: 0,Date,Close,Diff,High,Low,Open,Percent_Change,Volume,Percent_Change_Class
6,2009-06-19,31.668702,0.119021,32.167198,31.296651,31.850296,0.005157,1.007858e+07,neutral
19,2009-07-09,30.185718,0.163667,30.650137,29.808269,30.260979,0.004954,7.289074e+06,neutral
52,2009-08-25,36.120046,0.132574,36.640911,35.764966,36.155285,0.005104,6.799715e+06,neutral
121,2009-12-02,39.372489,0.179186,39.722330,38.921244,39.190090,0.005126,6.119327e+06,neutral
136,2009-12-23,40.379887,0.229072,40.641244,39.932104,40.237715,0.004789,4.470901e+06,neutral
137,2009-12-24,40.595701,0.215814,40.762262,40.271425,40.458054,0.005443,1.875179e+06,neutral
173,2010-02-18,39.882190,0.272573,40.071964,39.306501,39.544537,0.005485,6.235484e+06,neutral
187,2010-03-10,41.514382,0.209798,41.801213,41.037775,41.326382,0.005110,7.337071e+06,neutral
221,2010-04-28,43.705865,0.216067,44.238112,43.149101,43.756787,0.004860,8.829739e+06,neutral
304,2010-08-25,39.492220,0.205830,39.699417,38.685045,39.024552,0.005150,7.013509e+06,neutral


In [18]:
SP500_df['Percent_Change_Class'].loc[SP500_df['Date']==dt.date(2009, 6, 19)].values

array(['neutral'], dtype=object)

In [19]:
def match_dates_and_get_stock_change(features_df, target_df):
    change = []

    for i in features_df['Date']:
        try:
            if i in list(target_df['Date']):
                change.append(target_df['Percent_Change_Class'].loc[target_df['Date']==i].values)
            elif i not in list(target_df['Date']):
                change.append('x')
                pass
        except Exception as e:
            print('Error:', e)
            
    return pd.DataFrame(change)

In [21]:
start = datetime.now()

change_df = match_dates_and_get_stock_change(tweets_df, SP500_df)

end = datetime.now()
print(end - start)

0:01:04.010730


In [22]:
change_df[0].value_counts()

down       61050
x          17571
up         12549
neutral     3686
Name: 0, dtype: int64

In [23]:
61050+17571+12549+3686

94856

**Now, we have our X's and y's**<br>
Our <u>Features</u> == Tweets (\_id, text, timestamp, username, cleaned_text, Date)<br>
Our <u>Target</u> == change in stock prices (up/down/neutral)<br>

They have the same number of rows and are in the same order, so our X and Y's are ready! Now, we just need to drop the rows where our X's and y's don't have a Date in common.

**Merge & drop rows where there's no stock data for Tweets**

In [24]:
combined_df_nodrop = tweets_df.merge(change_df, left_index=True, right_index=True)

In [25]:
combined_df_nodrop.head(5)

Unnamed: 0,_id,text,timestamp,username,Date,cleaned_text,0
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama,2017-06-02,on this national gun violence awareness day le...,down
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama,2017-05-29,forever grateful for the service and sacrifice...,x
2,593deb7057bbd40476642240,Good to see my friend Prince Harry in London t...,2017-05-27 13:15:25,BarackObama,2017-05-27,good to see my friend prince harry in london t...,x
3,593deb7057bbd40476642241,"Through faith, love, and resolve the character...",2017-05-25 14:13:35,BarackObama,2017-05-25,through faith love and resolve the character o...,down
4,593deb7057bbd40476642242,Our hearts go out to those killed and wounded ...,2017-05-23 16:56:14,BarackObama,2017-05-23,our hearts go out to those killed and wounded ...,down


In [26]:
combined_df = combined_df_nodrop[combined_df_nodrop[0]!='x']

In [27]:
combined_df.head(5)

Unnamed: 0,_id,text,timestamp,username,Date,cleaned_text,0
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama,2017-06-02,on this national gun violence awareness day le...,down
3,593deb7057bbd40476642241,"Through faith, love, and resolve the character...",2017-05-25 14:13:35,BarackObama,2017-05-25,through faith love and resolve the character o...,down
4,593deb7057bbd40476642242,Our hearts go out to those killed and wounded ...,2017-05-23 16:56:14,BarackObama,2017-05-23,our hearts go out to those killed and wounded ...,down
5,593deb7057bbd40476642243,"Excited to hear from Sierra, Imani, Filiz, and...",2017-05-22 21:16:23,BarackObama,2017-05-22,excited to hear from sierra imani filiz and be...,neutral
7,593deb7057bbd40476642245,"We're rolling up our sleeves again, back where...",2017-05-03 19:42:16,BarackObama,2017-05-03,we re rolling up our sleeves again back where ...,down


In [28]:
pd.to_pickle(combined_df, '../Analyzing_Unstructured_Data_for_Finance/data/4.2.combined_df_SP500.pickle')

**Now split the data into X's and y's - our features and target**

In [29]:
X = combined_df.drop(0, axis=1)
y = combined_df[0]

In [30]:
print(X.shape)
print(y.shape)

(77285, 6)
(77285,)


In [31]:
77285.0/94856.0
# We kept 81% of original tweets (after dropping ones that don't match)

0.8147613224255714

In [32]:
pd.to_pickle(X, '../Analyzing_Unstructured_Data_for_Finance/data/4.2.X.pickle')

In [33]:
pd.to_pickle(y, '../Analyzing_Unstructured_Data_for_Finance/data/4.2.y_SP500.pickle')

In [34]:
combined_df[0].value_counts()

down       61050
up         12549
neutral     3686
Name: 0, dtype: int64

In [38]:
X_4_1 = pd.read_pickle('../Analyzing_Unstructured_Data_for_Finance/data/4.1.X.pickle')

In [39]:
X_4_2 = pd.read_pickle('../Analyzing_Unstructured_Data_for_Finance/data/4.2.X.pickle')

In [42]:
X_4_1.shape

(77283, 6)

In [43]:
X_4_2.shape

(77285, 6)