# Project 2: Scraping FT.com for news on US
## Part 2: Analysis
For the web scraping part, see Part 1 in other file.

#### 1. Import the libraries we need.

In [38]:
import pandas as pd
from datetime import datetime
from yahoo_finance import Share
import collections
import numpy as np
from bokeh.charts import TimeSeries
from bokeh.io import output_notebook
from bokeh.models import HoverTool
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
output_notebook()

#### 2. Get the news data scraped.

In [2]:
df = pd.DataFrame.from_csv('News_on_US.csv', index_col = None)

#### 3. Clean the news data for future analysis.

In [3]:
df1 = df.dropna()
df1.reset_index(inplace = True, drop = False)

In [4]:
df2 = df1.copy()
df2.Time = df2.Time.apply(lambda x: datetime.strptime(x, '%A, %d %B, %Y'))
df2.Time = df2.Time.apply(lambda x: x.date().strftime('%Y-%m-%d'))
df2.Title = df2.Title.apply(lambda x: repr(x))

Let's see what the data looks like.

In [5]:
df2.head()

Unnamed: 0,index,Title,Time
0,0,'Digital doctors: on a medical mission',2015-12-29
1,1,'Six Charts of Christmas: Valuation',2015-12-29
2,2,'FT person of the year \xe2\x80\x93 Angela Mer...,2015-12-29
3,3,'Iraqi forces claim victory in Ramadi',2015-12-28
4,4,'Drones crash into regulatory thicket',2015-12-28


We record the date range here and create a date_range item.

In [6]:
date_lb = sorted(df2.Time)[0]
date_ub = sorted(df2.Time)[-1]
print 'Date Range: %s to %s' % (date_lb, date_ub)
date_range_df = pd.date_range(start = date_lb, end = date_ub, freq = 'D')
date_range_lis = date_range_df.date.tolist()
total_date_lis = [str(x) for x in date_range_lis]

Date Range: 2006-06-28 to 2017-02-10


Now we create a dictionary where the keys are dates in the range above and the values are news titles grouped into single days. Note that when there is no news titles found for a date, we assign that date the string 'No News Today.'

In [7]:
temp_dict = dict(list(df2.groupby(df2.Time)))

In [8]:
news_dict = {}
for date in total_date_lis:
    if date in temp_dict.keys():
        news_dict[date] = '|||'.join(temp_dict[date].Title.tolist())
    else:
        news_dict[date] = 'No News Today'

We check what the news_dict looks like.

In [43]:
news_dict.items()[:5]

[('2008-11-17',
  "'Lex: Crunch time in the Valley'|||'Obama and McCain unite to \\xe2\\x80\\x98fix country\\xe2\\x80\\x99'|||'Colleges suffer big fall in endowments'|||'Survey exposes depth of US woe'|||'Gideon Rachman: Is Obama a Middle East \\xe2\\x80\\x98splitter\\xe2\\x80\\x99?'|||'Burmese ruby ban likely to be undermined'|||'EU, China and US in toy safety accord'"),
 ('2008-11-16',
  "'Beijing pressures US over Taiwan arms deal'|||'Car union rules out concessions'|||'Summit opens door to wide reforms'|||'Kissinger backs Hillary Clinton as top US diplomat'"),
 ('2008-11-15',
  "'Clinton a contender for secretary of state'|||'Treasury attacked over $700bn bail-out'"),
 ('2008-11-14',
  "'Obama\\xe2\\x80\\x99s absence puts Bush in focus'|||'JC Penney urges action on consumer spending'|||'Obama camp hints at key role for Clinton'"),
 ('2008-11-13',
  "'US drops plan to buy toxic assets'|||'Medvedev says he will meet Obama \\xe2\\x80\\x98soon\\xe2\\x80\\x99'|||'Signs of turmoil grow a

Looks good but it's unordered (a feature of all dictionary.) However, for our plot, we have to order it by date. We do it by changing it into an 'OrderedDict' item.

In [10]:
od = collections.OrderedDict(sorted(news_dict.items()))

Check what od looks like.

In [44]:
od.items()[:5]

[('2006-06-28',
  "'Adelphia deal approval may face condition'|||'Paulson approved as new Treasury secretary'|||'Fed expected to raise rates to 5.25%'|||'Texas map ruling spurs gerrymandering'|||'SEC \\xe2\\x80\\x98blocked Mack questioning in insider probe\\xe2\\x80\\x99'"),
 ('2006-06-29',
  "'Scrushy faces 30 years in prison'|||'Brussels warns over about-turn on travel to US'|||'Setback for Bush on Guant\\xc3\\xa1namo'|||'Bush enlists Koizumi on missiles stance'|||'Interim Doha deal proves elusive'|||'Seoul cajoles N Korea to return to talks'|||'Net neutrality suffers another defeat in Senate'"),
 ('2006-06-30',
  "'Summers named University professor'|||'Japan\\xe2\\x80\\x99s premier follows dream to Memphis'|||'Factors stall Bush in aim to close Guant\\xc3\\xa1namo'|||'Boeing in final deal over procurement scandals'|||'Mayor\\xe2\\x80\\x99s clout brings King papers to Atlanta'|||'Justices prove to be no lapdogs of Bush'|||'Lockerbie victims press US to keep squeeze on Libya'|||'Mark

Looks good. For our future plot, all we need is a list of all news titles ordered by the date, so now we create such a list.

#### 4. Create the list of news titles for our future plot.

In [12]:
news_lis = []
for k,v in od.items():
    news_lis.append(v)

In [45]:
news_lis[:5]

["'Adelphia deal approval may face condition'|||'Paulson approved as new Treasury secretary'|||'Fed expected to raise rates to 5.25%'|||'Texas map ruling spurs gerrymandering'|||'SEC \\xe2\\x80\\x98blocked Mack questioning in insider probe\\xe2\\x80\\x99'",
 "'Scrushy faces 30 years in prison'|||'Brussels warns over about-turn on travel to US'|||'Setback for Bush on Guant\\xc3\\xa1namo'|||'Bush enlists Koizumi on missiles stance'|||'Interim Doha deal proves elusive'|||'Seoul cajoles N Korea to return to talks'|||'Net neutrality suffers another defeat in Senate'",
 "'Summers named University professor'|||'Japan\\xe2\\x80\\x99s premier follows dream to Memphis'|||'Factors stall Bush in aim to close Guant\\xc3\\xa1namo'|||'Boeing in final deal over procurement scandals'|||'Mayor\\xe2\\x80\\x99s clout brings King papers to Atlanta'|||'Justices prove to be no lapdogs of Bush'|||'Lockerbie victims press US to keep squeeze on Libya'|||'Markets rally after statement from Fed'",
 'No News Today

Looks good. Now we can move on to collect and to clean the security data.

#### 5. Collect the price data from Yahoo Finance.

In [14]:
DIA = Share('DIA')
DIA_history = DIA.get_historical(start_date = date_lb, end_date = date_ub)
DIA_df = pd.DataFrame([[x['Date'], x['Close']] for x in DIA_history[0:]])
DIA_df1 = DIA_df.copy()
DIA_df1.columns = ['Date', 'Price']

In [15]:
DIA_df2 = DIA_df1.copy()
DIA_df2.Date = DIA_df2.Date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

In [16]:
print DIA_df2.head()
print len(DIA_df2)
print DIA_df2.loc[0, 'Date']
print DIA_df2.loc[250, 'Date']
print DIA_df2.tail()

        Date       Price
0 2017-02-10  202.740005
1 2017-02-09  201.720001
2 2017-02-08  200.509995
3 2017-02-07  200.580002
4 2017-02-06  200.279999
2675
2017-02-10 00:00:00
2016-02-16 00:00:00
           Date       Price
2670 2006-07-05  111.470001
2671 2006-07-03  112.199997
2672 2006-06-30  111.790001
2673 2006-06-29  111.800003
2674 2006-06-28      109.82


#### 6. Clean the price data for our future analysis.

In [17]:
merged_price = pd.merge(pd.DataFrame({'Date': date_range_df}), DIA_df2, on = 'Date', how = 'left')
merged_price.head()

Unnamed: 0,Date,Price
0,2006-06-28,109.82
1,2006-06-29,111.800003
2,2006-06-30,111.790001
3,2006-07-01,
4,2006-07-02,


Because we only have security price for days when the market is open, we have 'NaN' in our data frame. For our analysis purpose here, we fill the 'NaN' cells with linear interpolation.

In [18]:
merged_price1 = merged_price.copy()
merged_price1.Price = merged_price1.Price.astype(float).interpolate(method = 'linear')
merged_price1.head()

Unnamed: 0,Date,Price
0,2006-06-28,109.82
1,2006-06-29,111.800003
2,2006-06-30,111.790001
3,2006-07-01,111.926666
4,2006-07-02,112.063332


In [19]:
merged_price2 = merged_price1.set_index('Date')
merged_price2.index

DatetimeIndex(['2006-06-28', '2006-06-29', '2006-06-30', '2006-07-01',
               '2006-07-02', '2006-07-03', '2006-07-04', '2006-07-05',
               '2006-07-06', '2006-07-07',
               ...
               '2017-02-01', '2017-02-02', '2017-02-03', '2017-02-04',
               '2017-02-05', '2017-02-06', '2017-02-07', '2017-02-08',
               '2017-02-09', '2017-02-10'],
              dtype='datetime64[ns]', name=u'Date', length=3881, freq=None)

The 'freq = None' in the output above is causing problems for our future plot, and we have to change it into "freq = 'D'".

In [20]:
merged_price3 = pd.DataFrame(data = {'Price': merged_price2.Price}, 
                    index = pd.date_range(start = date_lb, end = date_ub, freq = 'D'))
print merged_price3.index
merged_price3.head()

DatetimeIndex(['2006-06-28', '2006-06-29', '2006-06-30', '2006-07-01',
               '2006-07-02', '2006-07-03', '2006-07-04', '2006-07-05',
               '2006-07-06', '2006-07-07',
               ...
               '2017-02-01', '2017-02-02', '2017-02-03', '2017-02-04',
               '2017-02-05', '2017-02-06', '2017-02-07', '2017-02-08',
               '2017-02-09', '2017-02-10'],
              dtype='datetime64[ns]', length=3881, freq='D')


Unnamed: 0,Price
2006-06-28,109.82
2006-06-29,111.800003
2006-06-30,111.790001
2006-07-01,111.926666
2006-07-02,112.063332


Now the output looks good and we can start to make our plot.

#### 7. Make interactive plot using Bokeh.

In [21]:
vals_list_of_list = merged_price3.values.T.tolist()

In [22]:
ts_list_of_list = []
for i in range(0,len(merged_price3.columns)):
    ts_list_of_list.append(merged_price3.index)

Define the tools to be shown in the plot.

In [23]:
_tools_to_show = 'box_zoom,pan,save,hover,resize,reset,tap,wheel_zoom' 

Create the plot.

In [24]:
p = figure(width=1200, height=900, x_axis_type="datetime", tools=_tools_to_show)

In [25]:
p.multi_line(ts_list_of_list, vals_list_of_list, line_color=['#3399cc'])

In [26]:
result_lis = []
for (name, series) in merged_price3.iteritems():
    result_lis.append((name, series))

In [27]:
name_for_display = np.tile('DIA', [len(merged_price3.index),1])

In [28]:
source = ColumnDataSource({'x': merged_price3.index, 'y': series.values, 'series_name': name_for_display, 
                           'News': news_lis, 'fonts': np.tile('<b>bold</b>', [len(merged_price3.index),1])})

In [29]:
p.scatter('x', 'y', source = source, fill_alpha=0, line_alpha=0.3, line_color="#cad3c5", size = 10)

In [30]:
hover = p.select(dict(type=HoverTool))

In [31]:
hover.tooltips = [("Series", "@series_name"), ("News", "@News"),  ("Value", "$y{0.00%}"),]

In [32]:
hover.mode = 'mouse'

In [39]:
show(p)

Bokeh is great and allows python users to make many types of interactive graphs. However, it seems to me that Highcharts in Javascripts can make better and more interactive graphs. I wasn't able to figure out how to use Javascript for this project in a week. Future commits and thoughts are welcome. 