# Part 4 -- Combine Features and Target (AAPL)

Since our feature data (Tweets) and target data (stocks) are pulled from different sources and have different numbers of rows and columns, we will need to do some cleaning to make sure our X's and y's share the same number of rows. This is important for when we feed our data into models.

**Load lib codes**

In [1]:
from os import chdir
chdir('/home/jovyan/work/Portfolio/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()

In [2]:
AAPL_df = pd.read_pickle('../Analyzing_Unstructured_Data_for_Finance/data/3.1.AAPL_df.pickle')

In [3]:
tweets_df = pd.read_pickle('../Analyzing_Unstructured_Data_for_Finance/data/2.tweets_df.pickle')

In [5]:
AAPL_df.shape

(2013, 9)

In [6]:
tweets_df.shape

(94856, 6)

**Do some checks to see if everything matches correctly before merging the dataframes**

In [11]:
AAPL_df[2000:2005]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Diff,Percent_Change,Percent_Change_Class
2000,2017-05-23,154.9,154.9,153.31,153.8,19918871,-0.19,-0.001235,down
2001,2017-05-24,153.84,154.17,152.67,153.34,19219154,-0.46,-0.003,down
2002,2017-05-25,153.73,154.35,153.03,153.87,19235598,0.53,0.003444,down
2003,2017-05-26,154.0,154.24,153.31,153.61,21927637,-0.26,-0.001693,down
2004,2017-05-30,153.42,154.43,153.33,153.67,20126851,0.06,0.00039,down


In [12]:
tweets_df[:5]

Unnamed: 0,_id,text,timestamp,username,Date,cleaned_text
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama,2017-06-02,on this national gun violence awareness day le...
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama,2017-05-29,forever grateful for the service and sacrifice...
2,593deb7057bbd40476642240,Good to see my friend Prince Harry in London t...,2017-05-27 13:15:25,BarackObama,2017-05-27,good to see my friend prince harry in london t...
3,593deb7057bbd40476642241,"Through faith, love, and resolve the character...",2017-05-25 14:13:35,BarackObama,2017-05-25,through faith love and resolve the character o...
4,593deb7057bbd40476642242,Our hearts go out to those killed and wounded ...,2017-05-23 16:56:14,BarackObama,2017-05-23,our hearts go out to those killed and wounded ...


In [10]:
tweets_df['Date'][3]==AAPL_df['Date'][2002]

True

In [13]:
AAPL_df[AAPL_df['Percent_Change_Class']=='neutral']

Unnamed: 0,Date,Open,High,Low,Close,Volume,Diff,Percent_Change,Percent_Change_Class
18,2009-07-08,19.42,19.72,19.20,19.60,143982048,0.26,0.013265,neutral
26,2009-07-20,21.90,22.15,21.56,21.84,183881187,0.16,0.007326,neutral
29,2009-07-23,22.38,22.63,22.22,22.55,131740378,0.16,0.007095,neutral
30,2009-07-24,22.42,22.86,22.36,22.86,109589914,0.31,0.013561,neutral
40,2009-08-07,23.64,23.80,23.54,23.64,96870928,0.22,0.009306,neutral
49,2009-08-20,23.57,23.82,23.52,23.76,85549730,0.25,0.010522,neutral
54,2009-08-27,24.11,24.22,23.55,24.21,112294826,0.29,0.011979,neutral
59,2009-09-03,23.78,23.87,23.57,23.79,73525711,0.19,0.007987,neutral
61,2009-09-08,24.71,24.73,24.57,24.70,78761627,0.37,0.014980,neutral
63,2009-09-10,24.58,24.75,24.40,24.65,122783346,0.20,0.008114,neutral


In [15]:
AAPL_df['Percent_Change_Class'].loc[AAPL_df['Date']==dt.date(2009, 7, 8)].values

array(['neutral'], dtype=object)

In [24]:
def match_dates_and_get_stock_change(features_df, target_df):
    change = []

    for i in features_df['Date']:
        try:
            if i in list(target_df['Date']):
                change.append(target_df['Percent_Change_Class'].loc[target_df['Date']==i].values)
            elif i not in list(target_df['Date']):
                change.append('x')
                pass
        except Exception as e:
            print('Error:', e)
            
    return pd.DataFrame(change)

In [25]:
start = datetime.now()

change_df = match_dates_and_get_stock_change(tweets_df, AAPL_df)

end = datetime.now()
print(end - start)

0:01:03.680343


In [26]:
change_df[0].value_counts()

down       54061
x          17573
neutral    16515
up          6707
Name: 0, dtype: int64

In [19]:
54061+17573+16515+6707

94856

**Now, we have our X's and y's**<br>
Our <u>Features</u> == Tweets (\_id, text, timestamp, username, cleaned_text, Date)<br>
Our <u>Target</u> == change in stock prices (up/down/neutral)<br>

They have the same number of rows and are in the same order, so our X and Y's are ready! Now, we just need to drop the rows where our X's and y's don't have a Date in common.

**Merge & drop rows where there's no stock data for Tweets**

In [27]:
combined_df_nodrop = tweets_df.merge(change_df, left_index=True, right_index=True)

In [31]:
combined_df_nodrop.head(5)

Unnamed: 0,_id,text,timestamp,username,Date,cleaned_text,0
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama,2017-06-02,on this national gun violence awareness day le...,neutral
1,593deb7057bbd4047664223f,Forever grateful for the service and sacrifice...,2017-05-29 13:09:16,BarackObama,2017-05-29,forever grateful for the service and sacrifice...,x
2,593deb7057bbd40476642240,Good to see my friend Prince Harry in London t...,2017-05-27 13:15:25,BarackObama,2017-05-27,good to see my friend prince harry in london t...,x
3,593deb7057bbd40476642241,"Through faith, love, and resolve the character...",2017-05-25 14:13:35,BarackObama,2017-05-25,through faith love and resolve the character o...,down
4,593deb7057bbd40476642242,Our hearts go out to those killed and wounded ...,2017-05-23 16:56:14,BarackObama,2017-05-23,our hearts go out to those killed and wounded ...,down


In [29]:
combined_df = combined_df_nodrop[combined_df_nodrop[0]!='x']

In [32]:
combined_df.head(5)

Unnamed: 0,_id,text,timestamp,username,Date,cleaned_text,0
0,593deb7057bbd4047664223e,"On this National Gun Violence Awareness Day, l...",2017-06-02 17:35:54,BarackObama,2017-06-02,on this national gun violence awareness day le...,neutral
3,593deb7057bbd40476642241,"Through faith, love, and resolve the character...",2017-05-25 14:13:35,BarackObama,2017-05-25,through faith love and resolve the character o...,down
4,593deb7057bbd40476642242,Our hearts go out to those killed and wounded ...,2017-05-23 16:56:14,BarackObama,2017-05-23,our hearts go out to those killed and wounded ...,down
5,593deb7057bbd40476642243,"Excited to hear from Sierra, Imani, Filiz, and...",2017-05-22 21:16:23,BarackObama,2017-05-22,excited to hear from sierra imani filiz and be...,neutral
7,593deb7057bbd40476642245,"We're rolling up our sleeves again, back where...",2017-05-03 19:42:16,BarackObama,2017-05-03,we re rolling up our sleeves again back where ...,down


In [33]:
pd.to_pickle(combined_df, '../Analyzing_Unstructured_Data_for_Finance/data/4.1.combined_df_AAPL.pickle')

**Now split the data into X's and y's - our features and target**

In [34]:
X = combined_df.drop(0, axis=1)
y = combined_df[0]

In [35]:
print(X.shape)
print(y.shape)

(77283, 6)
(77283,)


In [36]:
77283.0/94856.0
# We kept 81% of original tweets (after dropping ones that don't match)

0.8147402378341908

In [37]:
pd.to_pickle(X, '../Analyzing_Unstructured_Data_for_Finance/data/4.1.X.pickle')

In [38]:
pd.to_pickle(y, '../Analyzing_Unstructured_Data_for_Finance/data/4.1.y_AAPL.pickle')

In [39]:
combined_df[0].value_counts()

down       54061
neutral    16515
up          6707
Name: 0, dtype: int64