# Data Source and Frequency

Filter news headlines for the stocks Apple, Microsoft and Amazon.

This notebook is divided into following parts:
1. Import libraries
2. Import data
3. Data filteration
4. Print news headline sentiments for 'aapl'

## Import libraries

In [1]:
# For data manipulation
import pandas as pd

## Import data

1. news_headline_sentiments.csv

Importing only a chunk of the dataset at a time as it is huge. 

In [3]:
# Read and import news headline sentiments
chunks = pd.read_csv('news_headline_sentiments.csv', chunksize=10000)

# Create empty lists
aapl_data = []
amzn_data = []
msft_data = []

## Data filteration
Running a for loop for a chunk of dataset.

Performing following Steps:

1. Remove/Drop missing values
2. Convert the text to lower string
3. Select headlines with words 'apple', 'amazon' and 'microsoft'. Append similar data.
4. Combine and concatenate the data
5. Calculate the frequency of words


In [4]:
for df in chunks:
    # Remove/Drop missing values
    df.dropna(inplace=True)
    # Convert the words to lower string
    df.news_headline = df.news_headline.str.lower()

    # Selects the headlines with the word 'apple' or 'aapl'
    data = df.loc[(df.news_headline.str.contains('aapl')) |
                  (df.news_headline.str.contains('apple'))]
    # Append the data into the empty aapl_data created
    aapl_data.append(data)

    # Selects the headlines with the word 'amazon' or 'amzn'
    data = df.loc[(df.news_headline.str.contains('amzn')) |
                  (df.news_headline.str.contains('amazon'))]
    # Append the data into the empty amzn_data created
    amzn_data.append(data)

    # Selects the headlines with the word 'microsoft' or 'msft'
    data = df.loc[(df.news_headline.str.contains('msft')) |
                  (df.news_headline.str.contains('microsoft'))]
    # Append the data into the empty msft_data created
    msft_data.append(data)


# Concat data together
aapl_data = pd.concat(aapl_data)
amzn_data = pd.concat(amzn_data)
msft_data = pd.concat(msft_data)

# Calculate the frequency of the news headlines
frequency = {'aapl': len(aapl_data),
             'amzn': len(amzn_data),
             'msft': len(msft_data)
             }

# Returns the frequency of the words
frequency

{'aapl': 2984, 'amzn': 1351, 'msft': 1792}

In [None]:
# Returns dataframe to downloadable csv
aapl_data.to_csv('news_headline_sentiments_aapl.csv')

## Conclusion:

In the upcoming notebooks, we will use the downloadable csv file containing the news headline and sentiment class for Apple(aapl).
</span> <BR><BR>

##### OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

## Now Let's work with my datasets

In [1]:
import pandas as pd

In [None]:
# There were around 80000 rows in aapl_news_headline data.
# Then I deleted the duplicated rows by news_headline column
# Then filtered the dataframe and only took the necessary columns such as ['news_headline', 'time_stamp', 'URL']. 
# Earlier 'time_stamp' column was called 'start_time_stamp'

In [1]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 5.2 MB/s eta 0:00:01
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [6]:
# Just playing with the vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
new_words = {
    "bear" : -2.0,
    "bull" : 2.0,
}
analyzer.lexicon.update(new_words)

analyzer.lexicon["bull"]


2.0

In [17]:
headline = "apple 'blamed' for car crash, family sues company"
score = analyzer.polarity_scores(headline)
score

{'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.7003}

In [8]:
# Now let's get to the business and calculate sentiment class and sentiment score for our news headline data
def sentimenClassCalculation(negative, positive):
    
    if negative > positive:
        return -1
    elif positive > negative:
        return 1
    else:
        return 0


In [3]:
my_apple_dataset = pd.read_csv("aapl_news_headline.csv")

In [13]:
# Calculate sentiment_score
#my_apple_dataset["sentiment_score"] = my_apple_dataset["news_headline"].apply(lambda t: analyzer.polarity_scores(t)['compound'])
#my_apple_dataset.head()

Unnamed: 0,news_headline,time_stamp,URL,sentiment_score
0,carl icahn ups apple stake to $3 billion,2014-01-27 23:47:47,http://money.cnn.com/2014/01/22/technology/app...,0.0
1,the apple mac at 30: see the evolution of a…,2014-01-27 23:47:47,http://feeds.wired.com/c/35185/f/661457/s/3656...,0.0
2,apple engineers who brought mac to life get r...,2014-01-27 23:47:47,http://gadgets.ndtv.com/shortlink.aspx?article...,0.0
3,apple takes a fresh bite into china s market,2014-01-27 23:47:47,/id/101340364,0.3182
4,icahn takes another $500 million bite out of a...,2014-01-27 23:47:47,"/business/sns-rt-us-apple-icahn-20140122,0,515...",0.0


In [9]:
# Calculate sentiment_class

# my_apple_dataset["sentiment_class"] = 0
# for i in range(0, len(my_apple_dataset)):
    
#     score = analyzer.polarity_scores(my_apple_dataset["news_headline"][i])
#     my_apple_dataset["sentiment_class"][i] = sentimenClassCalculation(score['neg'], score['pos'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  my_apple_dataset["sentiment_class"][i] = sentimenClassCalculation(score['neg'], score['pos'])


In [10]:
my_apple_dataset

Unnamed: 0,news_headline,time_stamp,URL,sentiment_score,sentiment_class
0,carl icahn ups apple stake to $3 billion,2014-01-27 23:47:47,http://money.cnn.com/2014/01/22/technology/app...,0.0000,0
1,the apple mac at 30: see the evolution of a…,2014-01-27 23:47:47,http://feeds.wired.com/c/35185/f/661457/s/3656...,0.0000,0
2,apple engineers who brought mac to life get r...,2014-01-27 23:47:47,http://gadgets.ndtv.com/shortlink.aspx?article...,0.0000,0
3,apple takes a fresh bite into china s market,2014-01-27 23:47:47,/id/101340364,0.3182,1
4,icahn takes another $500 million bite out of a...,2014-01-27 23:47:47,"/business/sns-rt-us-apple-icahn-20140122,0,515...",0.0000,0
...,...,...,...,...,...
42278,"apple 'blamed' for car crash, family sues company",2016-12-31 08:15:02,/tech-news/apple-blamed-for-car-crash-family-s...,-0.7003,-1
42279,family sues apple for losing their daughter in...,2016-12-31 08:45:01,http://www.news18.com/news/tech/family-sues-ap...,-0.6486,-1
42280,apple said to seek lower taxes to start manufa...,2016-12-31 11:15:01,http://profit.ndtv.com/news/gadgets/article-ap...,-0.2960,-1
42281,texas family blame apple's facetime in suit ov...,2016-12-31 14:45:01,/Technology/wireStory/texas-family-blame-apple...,-0.8225,-1


In [11]:
#my_apple_dataset.to_csv("aapl_news_headline.csv", index=False)

In [None]:
# NOW FROM HERE ONWARDS I'M JUST CHECKING ROWS THAT ARE IN MY DATASET BUT NOT IN QUANTRA DATASET

In [12]:

# my_apple_dataset = pd.read_csv("aapl_news_headline.csv")
# my_apple_dataset

Unnamed: 0,news_headline,time_stamp,URL,sentiment_score,sentiment_class
0,carl icahn ups apple stake to $3 billion,2014-01-27 23:47:47,http://money.cnn.com/2014/01/22/technology/app...,0.0000,0
1,the apple mac at 30: see the evolution of a…,2014-01-27 23:47:47,http://feeds.wired.com/c/35185/f/661457/s/3656...,0.0000,0
2,apple engineers who brought mac to life get r...,2014-01-27 23:47:47,http://gadgets.ndtv.com/shortlink.aspx?article...,0.0000,0
3,apple takes a fresh bite into china s market,2014-01-27 23:47:47,/id/101340364,0.3182,1
4,icahn takes another $500 million bite out of a...,2014-01-27 23:47:47,"/business/sns-rt-us-apple-icahn-20140122,0,515...",0.0000,0
...,...,...,...,...,...
42278,"apple 'blamed' for car crash, family sues company",2016-12-31 08:15:02,/tech-news/apple-blamed-for-car-crash-family-s...,-0.7003,-1
42279,family sues apple for losing their daughter in...,2016-12-31 08:45:01,http://www.news18.com/news/tech/family-sues-ap...,-0.6486,-1
42280,apple said to seek lower taxes to start manufa...,2016-12-31 11:15:01,http://profit.ndtv.com/news/gadgets/article-ap...,-0.2960,-1
42281,texas family blame apple's facetime in suit ov...,2016-12-31 14:45:01,/Technology/wireStory/texas-family-blame-apple...,-0.8225,-1


In [13]:
# quantra_apple_dataset = pd.read_csv("news_headline_sentiments_aapl.csv")
# quantra_apple_dataset

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,news_headline,time_stamp,URL,source_id,sentiment_class,sentiment_scores
0,29,2012757,apple will refund at least $32.5m in app case,2014-01-27 10:17:47-08:00,//www.suntimes.com/business/24974362-420/apple...,1303,0,0.0000
1,169,2012707,"icahn raises apple stake, now owns $3b in stock",2014-01-27 10:17:47-08:00,//www.suntimes.com/business/25119767-420/icahn...,1303,0,0.0000
2,616,2012614,apple: people spent $10b in its app store in 2013,2014-01-27 10:17:47-08:00,//www.suntimes.com/business/24816093-420/apple...,1303,0,0.0000
3,618,2007933,"apple s mac still influencing computing, 30 ye...",2014-01-27 10:17:47-08:00,http://www.nbcnews.com/business/apples-mac-sti...,1288,0,0.0000
4,726,2011678,new apple tv set-top box expected in first hal...,2014-01-27 10:17:47-08:00,/business/technology/la-fi-tn-new-apple-tv-fir...,1298,0,0.0000
...,...,...,...,...,...,...,...,...
15308,1999722,1309707,govt not in favour of used apple iphones being...,2016-05-29 18:15:01-07:00,http://realtime.rediff.com/news/business/Govt-...,1502,-1,-0.3412
15309,1999813,1309841,apple?s refurbished iphone plan: govt not in f...,2016-05-29 18:45:01-07:00,http://www.hindustantimes.com/business/discuss...,1243,-1,-0.4952
15310,1999827,1309814,commerce ministry turns down apple?s proposal ...,2016-05-29 18:45:01-07:00,http://economictimes.indiatimes.com/news/econo...,1234,-1,-0.1880
15311,1999878,1309859,nirmala sitharaman may help pave way for apple...,2016-05-29 18:45:01-07:00,http://realtime.rediff.com/news/business/Nirma...,1502,1,0.4019


In [14]:
# common_dataset = my_apple_dataset.merge(quantra_apple_dataset, on=["news_headline"])
# common_dataset

Unnamed: 0.2,news_headline,time_stamp_x,URL_x,sentiment_score,sentiment_class_x,Unnamed: 0,Unnamed: 0.1,time_stamp_y,URL_y,source_id,sentiment_class_y,sentiment_scores
0,"icahn raises apple stake, now owns $3b in stock",2014-01-27 23:47:47,//www.suntimes.com/business/25119767-420/icahn...,0.0000,0,169,2012707,2014-01-27 10:17:47-08:00,//www.suntimes.com/business/25119767-420/icahn...,1303,0,0.0000
1,"let me take you back to 2007, when apple co-fo...",2014-01-27 23:47:47,http://financialexpress.com/news/when-the-pen-...,0.0000,0,740,1988180,2014-01-27 10:17:47-08:00,http://financialexpress.com/news/when-the-pen-...,1229,0,0.0000
2,apple will refund at least $32.5m in app case,2014-01-27 23:47:47,//www.suntimes.com/business/24974362-420/apple...,0.0000,0,29,2012757,2014-01-27 10:17:47-08:00,//www.suntimes.com/business/24974362-420/apple...,1303,0,0.0000
3,glass house + stone (or something) = apple sto...,2014-01-27 23:47:47,http://abcnews.go.com/blogs/technology/2014/01...,-0.6249,-1,1238,2007201,2014-01-27 10:17:47-08:00,http://abcnews.go.com/blogs/technology/2014/01...,1278,-1,-0.6249
4,watch steve jobs demo the first apple mac back...,2014-01-27 23:47:47,http://ibnlive.in.com/news/watch-steve-jobs-de...,0.0000,0,951,2005327,2014-01-27 10:17:47-08:00,http://ibnlive.in.com/news/watch-steve-jobs-de...,1273,0,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...
15046,commerce minister says govt not in favour of a...,2016-05-30 06:45:01,/Companies/lDn53REhqvhqGoIxs9MgwJ/Apples-propo...,-0.3412,-1,1999624,1309809,2016-05-29 18:15:01-07:00,/Companies/lDn53REhqvhqGoIxs9MgwJ/Apples-propo...,1511,-1,-0.5292
15047,apple?s refurbished iphone plan: govt not in f...,2016-05-30 07:15:01,http://www.hindustantimes.com/business/discuss...,-0.3412,-1,1999813,1309841,2016-05-29 18:45:01-07:00,http://www.hindustantimes.com/business/discuss...,1243,-1,-0.4952
15048,commerce ministry turns down apple?s proposal ...,2016-05-30 07:15:01,http://economictimes.indiatimes.com/news/econo...,0.0000,0,1999827,1309814,2016-05-29 18:45:01-07:00,http://economictimes.indiatimes.com/news/econo...,1234,-1,-0.1880
15049,nirmala sitharaman may help pave way for apple...,2016-05-30 07:15:01,http://realtime.rediff.com/news/business/Nirma...,0.4019,1,1999878,1309859,2016-05-29 18:45:01-07:00,http://realtime.rediff.com/news/business/Nirma...,1502,1,0.4019


In [15]:
# Just Checking which news_headline data are there in my dataset, but they are not in Quantra dataset
# rows_that_are_in_my_dataset_but_not_in_quantra_dataset = my_apple_dataset[(~my_apple_dataset.news_headline.isin(common_dataset.news_headline))]
# rows_that_are_in_my_dataset_but_not_in_quantra_dataset = rows_that_are_in_my_dataset_but_not_in_quantra_dataset.reset_index(drop=True)
# rows_that_are_in_my_dataset_but_not_in_quantra_dataset

Unnamed: 0,news_headline,time_stamp,URL,sentiment_score,sentiment_class
0,carl icahn ups apple stake to $3 billion,2014-01-27 23:47:47,http://money.cnn.com/2014/01/22/technology/app...,0.0000,0
1,the apple mac at 30: see the evolution of a…,2014-01-27 23:47:47,http://feeds.wired.com/c/35185/f/661457/s/3656...,0.0000,0
2,apple engineers who brought mac to life get r...,2014-01-27 23:47:47,http://gadgets.ndtv.com/shortlink.aspx?article...,0.0000,0
3,apple takes a fresh bite into china s market,2014-01-27 23:47:47,/id/101340364,0.3182,1
4,icahn takes another $500 million bite out of a...,2014-01-27 23:47:47,"/business/sns-rt-us-apple-icahn-20140122,0,515...",0.0000,0
...,...,...,...,...,...
28920,"apple 'blamed' for car crash, family sues company",2016-12-31 08:15:02,/tech-news/apple-blamed-for-car-crash-family-s...,-0.7003,-1
28921,family sues apple for losing their daughter in...,2016-12-31 08:45:01,http://www.news18.com/news/tech/family-sues-ap...,-0.6486,-1
28922,apple said to seek lower taxes to start manufa...,2016-12-31 11:15:01,http://profit.ndtv.com/news/gadgets/article-ap...,-0.2960,-1
28923,texas family blame apple's facetime in suit ov...,2016-12-31 14:45:01,/Technology/wireStory/texas-family-blame-apple...,-0.8225,-1


In [16]:
#rows_that_are_in_my_dataset_but_not_in_quantra_dataset.to_csv("rows_that_are_in_my_dataset_but_not_in_quantra_dataset.csv", index=False)