
<a id="overview"></a>

# Predicting Stock Price Movement


## Overview
It's your first day at RedBoulder, the globe's preeminent financial institution. On your first day, you are assigned to design an algorithm to predict the movement of the equity, [Twilio, Inc](https://finance.yahoo.com/quote/TWLO/). You are given two datasets:

1. Price / volume data for the month of August 2019
2. Twitter data for the month of August 2019. This dataset includes every single tweet posted in that month that includes the cashtag `$TWLO`

In this project, you will be tasked with importing, cleaning, visualizing, and analyzing market data for Twilio, Inc. (TWLO). Your goal is to predict how the price is likely to move in short intervals. To do so, you'll be  comparing market data to trends ascertained from sentiment data collected on Twitter.


**Expected Time to complete: 5-10 hours**

## Objectives

This assignment will provide you with a chance to:

- Use `pandas` to read in CSV files.
- Practice core `pandas` functionality such as `merge`, filtering, etc.
- Graph trends to determine insights with `matplotlib`.
- Clean and structure stock price prediction data to forecast demand.


## Problem

Your task is to set up the data and ascertain signals that correlate to stock price movement. You'll be answering questions, such as:

- What is the volatility of our data within 30 minute windows?
- Can we predict price movement within 2 hours with a degree of accuracy better than random chance?

> Hint: This assignment works extensively with financial terminology. If you need a refresher, [click here](https://www.investopedia.com/insights/digging-deeper-bull-and-bear-markets/).

## Structure

This notebook walks through Pythonic data analysis in different stages: 

- **Required:** This section covers classroom topics from recent units. These are _required_. 
- **Advanced:** This section covers upcoming topics. These are _optional_.

Throughout the notebook, you will see clearly labeled sections._You must provide answers to all of the questions in the **Required** section._ Some questions have been further divided up into "Part 1", "Part 2", (etc) in order to break down the steps of sequential logic used in Python programming. Please attempt answers for all parts.

For those of you who wish to work ahead or want to come back later for more practice, the **Advanced** section offers additional prompts that will extend your analysis. This section is optional; you do not need to complete these for submission; however, depending on the discretion of your section instructor, these questions may be completed for bonus points.


## Instructions

1. Open the assignment notebook. 
1. Save a copy of your notebook and retitle it: "yourname_assignment.ipynb"
1. Attempt answers for all **Required questions**. Some questions can be solved in many different ways!
1. Include at least one comment per question explaining your logic or approach. To include a comment in your Python code, use the `#` sign.
1. Make sure to include all work within your Jupyter notebook.
1. Submit answers for the **Required questions** to your instructional team by the due date.
1. Have fun!

## Data

Our dataset includes two CSV files: `twlo.csv` and `tweets.csv`.

- `Twlo.csv` includes price / volume data for the month of August 2019.
- `Tweets.csv` includes Twitter data for the month of August 2019, pulling in every single tweet posted that month which included the tag `$TWLO`.


1. Within `twlo.csv`, you'll see the following columns:

    - date - the date of the stock market data point (by the minute)
    - close - the closing price
    - high - high for the minute
    - low - low for the minute
    - open - opening price
    - volume - trading volume at that minute
    

2. In `tweets.csv`, you'll find:

    - text - the text of all tweets containing the tag `$TWLO`
    - `tweet_unique_id` - twittter's unique id for the tweet
    - `date_tweeted` - the date that tweet was posted
    - `author_handle` - handle for the author
    - `author_id` - unique id for the author
    - `author_verified` - boolean, whether or not the author is "Twitter verified"
    - `num_followers` - How many followers the author has
    - `num_following` - How many people the author follows
  

---
### Setting up our Environment

In [2]:
# Import our libraries for data manipulation and plotting:

import pandas as pd
from matplotlib import style

import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
data = pd.read_csv('data/twlo.csv')

------------------

# REQUIRED / GRADED
> **Required:** This section covers classroom topics from class and is _required_. 

Begin your analysis by analyzing the data with Pandas. 

Ready, set, go!


---

## Question 1

Change the index to be the `date` column [and use `tz_convert`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.tz_convert.html) to convert the dates to Eastern timezone (they are currently in UTC).

In [3]:
# Now you try!
# Enter your solution for Q1

---

## Question 2

Use pandas plot to plot the closing price as a line graph over the entire month of August. Include a title for the graph.

In [5]:
# Now you try!
# Enter your solution for Q2

---

## Question 3

Use Matplotlib (not built in pandas plotting) to plot two graphs on top of one another:

- **Part 1**. The top graph should be the closing price plotted over the entire month (same as the last question)
- **Part 2**. The bottom graph should be the volume as a bar chart

> Hint: If you want to customize your plots, [check out the documentation for `style.use`](https://matplotlib.org/3.1.1/api/style_api.html?highlight=style%20use#matplotlib.style.use). [Click here for further inspiration](https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html).

In [7]:
# Now you try!
# Enter your solution for Q3, Parts 1 and 2

---

## Question 4

Use pandas' [rolling method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) to calculate the rolling standard deviation and rolling mean with 30 minute windows for the "close" and "volume" columns.

> Hint: Some cells will be empty because there isn't any data from the last 30 minutes for early trading minutes. This is ok!

In [9]:
# Now you try!
# Enter your solution for Q4

---

### Tutorial

The **[Coefficient of Variation](https://en.wikipedia.org/wiki/Coefficient_of_variation)** provides a standardized measure of dispersion that gives some insight into the volatility of day to day trading.

---

## Question 5

Calculate the rolling "Coefficient of Variation" (calculated as the `rolling std / rolling mean`) with a window of 60 minutes for each column.

> Remember, some cells will be empty because there isn't any data from the last 30 minutes for early trading minutes. This is ok!

In [13]:
# Now you try!
# Enter your solution for Q5

---

### Tutorial

In [15]:
# Great, now let's import the twitter data!

tweets = pd.read_csv('data/twlo_tweets.csv')

tweets.head()

Unnamed: 0,text,tweet_unique_id,author_handle,author_id,author_verified,date_tweeted,num_followers,num_following
0,"RT @HedgeMind: $TWLO's hype growth continues, ...",1156719610433015808,AshwinMaddi,57757815,False,2019-08-01 00:13:40+00:00,191,548
1,"Blog Post: ""Burn It to the Ground""\n\nhttps://...",1156719981083729923,JohnBonini,23892061,False,2019-08-01 00:15:08+00:00,390,688
2,$TWLO Twilio Inc. Class A Commo #LeaderPullBac...,1156721339325177856,stockmktgenius,914214567152160768,False,2019-08-01 00:20:32+00:00,597,58
3,Jeff Lawson (Twilio $TWLO Founder &amp; CEO) i...,1156727135530360834,newmoneyFC,3278201701,False,2019-08-01 00:43:34+00:00,1437,201
4,"RT @EvanKenty: $TWLO Regained all AH drop, sho...",1156727777195958280,j_p_jacques,31173813,False,2019-08-01 00:46:07+00:00,1344,1780


---

## Question 6

Change the index of `tweets` to be the `date_tweeted` column and use [tz_convert](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.tz_convert.html) to convert the dates to Eastern timezone (they are currently all in UTC)

In [16]:
# Now you try!
# Enter your solution for Q6

---

## Question 7

- **Part 1**: Show the tweets that contain the word `bull` in it
- **Part 2**: Show a bar chart showing the number of tweets that come out by the hour of the day

> Hint: There should be 24 bars, each bar showing how many tweets came out at that hour over the entire month.

In [18]:
# Now you try!
# Enter your solution for Q7, Part 1

In [19]:
# Now you try!
# Enter your solution for Q7, Part 2

------------------

# ADVANCED 

> **Advanced:** This section covers upcoming topics from future units. These questions are _optional_. 

So far, you've learned to work with Pandas to analyze data... but there's still a lot more to be done!

The following questions are NOT required for submission; however, they will help you expand your analysis. In the next section, we'll deepen our analysis by comparing training and test data and then building a decision tree classifier.

> Hint: If you feel like proceeding, we recommend that you spend some time with the documentation for [ScikitLearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

---

### Tutorial

In [22]:
# Let's add a column called `minute`, which will be the `date_tweeted` column rounded to the nearest minute:

tweets['minute'] = tweets.index.round(freq='1T')
tweets.head(2)

Unnamed: 0_level_0,text,tweet_unique_id,author_handle,author_id,author_verified,num_followers,num_following,minute
date_tweeted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-07-31 20:13:40-04:00,"RT @HedgeMind: $TWLO's hype growth continues, ...",1156719610433015808,AshwinMaddi,57757815,False,191,548,2019-07-31 20:14:00-04:00
2019-07-31 20:15:08-04:00,"Blog Post: ""Burn It to the Ground""\n\nhttps://...",1156719981083729923,JohnBonini,23892061,False,390,688,2019-07-31 20:15:00-04:00


In [23]:
# Now we'll reate a DataFrame that groups the "tweets" DataFrame by the minute, then counts the number of tweets, 
# and sums the number of people (collectively) who were the audience for tweets at that minute!

tweet_count_by_minute = tweets.groupby('minute').agg({'num_followers': 'sum', 'tweet_unique_id': 'count'})

tweet_count_by_minute.head()

Unnamed: 0_level_0,num_followers,tweet_unique_id
minute,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-07-31 20:14:00-04:00,191,1
2019-07-31 20:15:00-04:00,390,1
2019-07-31 20:21:00-04:00,597,1
2019-07-31 20:44:00-04:00,1437,1
2019-07-31 20:46:00-04:00,1344,1


---

## Question 8

- **Part 1** - Merge the `tweet_count_by_minute` DataFrame into our working data using `pandas.merge`, [making sure to fill any missing data with a 0](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html).

- **Part 2** - Next, [use pandas.shift](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html) to add a new column called `price_change_in_two_hours`, which should return the closing price 120 minutes from the given row.

In [24]:
# Now you try!
# Enter your solution for Q8, Part 1

In [25]:
# Now you try!
# Enter your solution for Q8, Part 2

---

## Question 9

Now let's add a column to `data` called `will_go_up_in_two_hours_category` that should be either: 
- `Bearish` (if `will_go_up_in_two_hours_category < 0`) or 
- `Bullish` (if `will_go_up_in_two_hours_category > 0`)


When you're done, the value counts of the column should read:

```

>>> data['will_go_up_in_two_hours_category'].value_counts()

Bearish    2868
Bullish    2790
Name: will_go_up_in_two_hours_category, dtype: int64

```

In [4]:
# Solution for Q9 (note: there are many different ways to do this!)


---

## Question 10

Create the following columns and add them to the `data` DataFrame:

1. `feature__30_min_rolling_close` - 30 min rolling mean for close
2. `feature__60_min_rolling_close` - 60 min rolling mean for close
3. `feature__60_min_rolling_volume` - 60 min rolling mean for volume
4. `feature__60_min_rolling_tweet_followers` - 60 min rolling mean for `num_followers`
5. `feature__60_min_rolling_tweets_count` - 60 min rolling mean for `tweet_unique_id`

Remember, our goal is to use these signals to predict the movement of the stock price at every given minute!

In [32]:
# Now you try!
# Enter your solution for Q10

---

### Tutorial

In [34]:
# Create a new DataFrame called `prediction_df` that only selects rows with even hours or at the 30 minute mark:
prediction_df = data[(data.index.minute == 30) & (data.index.hour % 2 == 0)].copy()

# Remove rows with missing values:
prediction_df.dropna(inplace=True)

# Show the first five rows:
prediction_df.head()

Unnamed: 0_level_0,close,high,low,open,volume,num_followers,tweet_unique_id,price_change_in_two_hours,will_go_up_in_two_hours_category,feature__30_min_rolling_close,feature__60_min_rolling_close,feature__60_min_rolling_volume,feature__60_min_rolling_tweet_followers,feature__60_min_rolling_tweets_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2019-08-01 10:30:00-04:00,139.899994,140.490005,139.690002,140.380005,59803.0,0.0,0.0,3.009903,Bullish,137.881154,138.144339,35869.9,14810.333333,38.0
2019-08-01 12:30:00-04:00,142.909897,142.909897,142.664993,142.664993,6050.0,0.0,0.0,-5.40979,Bearish,142.303178,142.602729,12391.633333,5207.733333,19.0
2019-08-02 10:30:00-04:00,135.009995,135.294601,135.0,135.294601,7083.0,0.0,0.0,-0.939987,Bearish,135.645647,135.462765,12487.433333,57.416667,4.0
2019-08-02 12:30:00-04:00,134.070007,134.070007,134.005798,134.035202,3844.0,0.0,0.0,-1.110001,Bearish,133.664,133.99713,9495.833333,0.0,0.0
2019-08-05 10:30:00-04:00,123.716003,123.769997,123.459999,123.769997,19254.0,0.0,0.0,0.6054,Bullish,124.556061,124.602685,32691.366667,746.033333,23.0


---

## Question 11

Now, create two new DataFrames, `training` and `testing` where:

1. `training` is every row in `prediction_df` with dates before 8/20/2019
1. `testing` is every row in `prediction_df` with dates on or after 8/20/2019

Our goal is to investigate how our signals interact with our prediction column (e.g. "Bullish vs Bearish") on the DataFrame `training`. We want to see if those hypotheses pan out for our testing set. This will test our algorithms' ability to work on data it hasn't seen before!

In [35]:
# Now you try!
# Enter your solution for Q11

---

### Tutorial

In [37]:
# Let's see which of our potential signals correlate to the column `price_change_in_two_hours`.

# We'll use `pandas.filter` (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html)
# to limit our selections only to the specific columns with the word "feature" in them:
training.corr()['price_change_in_two_hours'].filter(regex='feature')

feature__30_min_rolling_close             -0.262944
feature__60_min_rolling_close             -0.266051
feature__60_min_rolling_volume             0.141402
feature__60_min_rolling_tweet_followers    0.136331
feature__60_min_rolling_tweets_count       0.106226
Name: price_change_in_two_hours, dtype: float64

In [50]:
# Interesting, `feature__60_min_rolling_close` and `feature__60_min_rolling_volume` seem to be very correlated!

# Let's further isolate the training data into two different DataFrames; 
# one for "bullish" training points and another for "bearish" training points:

bullish = training[training['will_go_up_in_two_hours_category'] == 'Bullish'] 
bearish = training[training['will_go_up_in_two_hours_category'] == 'Bearish'] 


---

## Question 12

Use Matplotlib to create two scatter plots in the same figure, where:

1. The x axis is the `feature__60_min_rolling_close` column of "bullish/bearish" and the y axis is `feature__60_min_rolling_volume`.
2. Color the bullish points **green** and the bearish points **red**.
3. Size the dots so that the larger they are, the more the price moved (i.e. large red dots means large price drop and vice versa).
3. Include a title and legend!

> Hint: For help creating legends, [check out Matplotlib's documentation](https://matplotlib.org/3.1.1/tutorials/intermediate/legend_guide.html).

In [39]:
# Now you try!
# Enter your solution for Q12