In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt # date and time
import matplotlib.pyplot as plt # plotting tool
import seaborn as sns # plotting tool
import random  #random values generator
from scipy.stats import ttest_ind ## statistical library
# this part is to connect to the dataset of the competition, It is only accessible from 
# kaggle notebook

from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()
(market_train_df, news_train_df) = env.get_training_data()
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

## Setting the golabl figure size of all plots included in this report





In [None]:
## Getting rid of the noisy warnings and side information when running plots 
import warnings
warnings.filterwarnings('ignore')
## setting the plot figure size
plt.rcParams['figure.figsize'] = (20.0, 10.0)


  **ABOUT THE DATA**
        The Data comes from the Kaggle competition titled ( Two Sigma: Using News to Predict Stock Movements ) in this website : https://www.kaggle.com/c/two-sigma-financial-news
        
1.  Market data (2007 to present) provided by Intrinio - contains financial market information such as opening price, closing price, trading volume, calculated returns, etc
2.  News data (2007 to present) Source: Thomson Reuters - contains information about news articles/alerts published about assets, such as article details, sentiment, and other commentary
     

    News data provided by Thomson Reuters. Copyright ©, Thomson Reuters, 2017. All Rights Reserved. Use, duplication, or sale of this service, or data contained herein, except as described in the Competition Rules, is strictly prohibited.

## Data Overview

In this capstone project, I will explore the data, and try to answer three questions that will provide some insight about the data provided. 
First lets take a look at the Data :


In [None]:
## Showing the shape (number of ) of our 1st dataset, the market dataset
print(f'{market_train_df.shape[0]} samples and {market_train_df.shape[1]} features in the training market dataset.')


In [None]:
## Showing the first 5 rows of our dataset
market_train_df.head()



In [None]:
## Showing the shape of our 2nd dataset, the news dataset
print(f'{news_train_df.shape[0]} samples and {news_train_df.shape[1]} features in the training news dataset.')

In [None]:
## Showing the last 5 rows of our dataset
news_train_df.tail()

Having an overview of the data helps in the process of coming up with the exploratory questions necessery to have an idea of some of the trends or anomalies in the dataset.

The data is Big, with many rows on both tables and a hefty amount of columns or features. There are 4,072,956 samples and 16 features in the training market dataset and 9,328,750 samples and 35 features in the training news dataset. However, the questions will only involve one company at a time to minimize the data used.

## First Question

What is the distribution of the  daily change in price of Apple stock Over a period of time? What about a random company ?

In [None]:
# This part is to get the Apple stock rows from the original  market dataset

asset1Code = 'AAPL.O'
asset1_df = market_train_df[(market_train_df['assetCode'] == asset1Code) 
                            & (market_train_df['time'] < '2017-01-01')]
# This part is to get a random stock rows from the original market dataset
asset2Code = market_train_df['assetCode'][random.randint(0, market_train_df.shape[0])]
asset2_df = market_train_df[(market_train_df['assetCode'] == f'{asset2Code}') 
                            & (market_train_df['time'] < '2017-01-01')]
# This part is to get the Apple stock rows from the original  news dataset
asset3Name = 'Apple Inc'
asset3_df = news_train_df.loc[lambda df: df['assetName'] == asset3Name, :]


### Apple stock Distribution

In [None]:
## This part is to plot a histogram that shows the distribution of returns in apple stock 
assets = tuple(asset1_df.loc[:,'returnsClosePrevRaw1'])
sns.distplot(assets, hist=True, kde=True, 
             bins=int(180/5), color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2, 'color':'k'})
meanreturn = asset1_df.loc[:,'returnsClosePrevRaw1'].values.mean()
plt.axvline(meanreturn, 
            color='r',
            linestyle='dashed',
            linewidth=2)
plt.xlabel('Apple returns at close time')
plt.ylabel('Frequency')
plt.title('Apple stock returns Frequency Distribution');

### The Random Stock Distribution

In [None]:
## This part is to plot a histogram that shows the distribution of returns in the random stock 
sns.distplot(asset2_df.loc[:,'returnsClosePrevRaw1'], hist=True, kde=True, 
             bins=int(180/5), color = 'burlywood', 
             hist_kws={'edgecolor':'white'},
             kde_kws={'linewidth': 2, 'color':'k'});
plt.axvline(asset2_df.loc[:,'returnsClosePrevRaw1'].mean(), 
            color='r', 
            linestyle='dashed', 
            linewidth=2)
plt.xlabel(f'{asset2_df.assetName[0]} returns at close time')
plt.ylabel('Frequency')
plt.title(f'{asset2_df.assetName[0]} stock returns Frequency Distribution');

The histogram shows the returns at close time for Apple follows a normal distribution, this is what I expected but it is nice to see it. The mean of is actually around 0 (Does this mean stocks are a zero sum game ? ) 

For any other random company the standard deviation and width of the normal distribution seems to be different from company to company,

## Second Question 
is there a correlation between the number of negative sentiment news articles and the fluctuations of the stocks of apple?


In this part of answering the question we will use the scatterplot to see if there is a correlation between the negative sentiment level and the Apple Stock Returns Projected for 10 days and adjusted for market.


In [None]:
## I will merge both data sets on the time column

asset1_df['date'] = asset1_df['time'].dt.strftime(date_format='%Y-%m-%d')
asset3_df['date'] = asset3_df['time'].dt.strftime(date_format='%Y-%m-%d')

In [None]:
## Taking only the negative sentiment rows and grouping it by the date variable to be able to merge the tabels together

meanSent = pd.DataFrame(asset3_df.groupby('date')['sentimentNegative'].mean())

asset_merged = pd.merge(asset1_df, meanSent, on= 'date')

In [None]:
# plotting a scattor plot alongside a regression line that would show if there is a correlation
sns.regplot(asset_merged.loc[:,'sentimentNegative'].values,
            asset_merged.loc[:,'returnsOpenNextMktres10'].values, 
            line_kws={'color' : 'darkred'})
plt.xlabel ('Negative Sentiment Level')
plt.ylabel('Apple Stock Returns Projected')
plt.title(f'{asset1_df.assetName[0]} stock returns Frequency Distribution');

From our resulted graph, the linear regression line is also added to the scatter plot, which shows there is no significant correlation between the negative sentiment and the returns of the stock. Maybe this is just the case of Apple Inc. given that their stock is not highly volatile with the news. 



## Third Question 
Since the previous assessment of the correlation between the sentiment of articles (The mean of the daily news sentiment value). Lets see if the mean of Apple inc stocks is statistically  different  based on the sentiment of articles (i.e. positive, negative).

In [None]:
## getting the positive and negative sentiment rows
asset_negative = asset3_df.loc[lambda df: df.loc[:,'sentimentClass'] == -1, :]

asset_positive =  asset3_df.loc[lambda df: df.loc[:,'sentimentClass'] == 1, :]
meanNeg = pd.DataFrame(asset_negative.groupby('date')['sentimentNegative'].mean())
meanPos = pd.DataFrame(asset_positive.groupby('date')['sentimentPositive'].mean())
asset_merged_neg = pd.merge(asset1_df, meanNeg, on= 'date')
asset_merged_pos = pd.merge(asset1_df, meanPos, on= 'date')

In [None]:
## Graphing a histogram to show the different distributions
sns.distplot(asset_merged_neg.loc[:,'returnsOpenNextMktres10'], color="r")
sns.distplot(asset_merged_pos.loc[:,'returnsOpenNextMktres10'], color="g")

plt.xlabel ('returns values')
plt.ylabel('Frequency')
plt.title('Distribution of Returns of the opening of 10 days');

In [None]:
## Applying a t-test on the 
print(ttest_ind(asset_merged_pos.loc[:,'returnsOpenNextMktres10'], 
                asset_merged_neg.loc[:,'returnsOpenNextMktres10']))

Given that the t-tes resutled in the values: 
T-statistic = 0.284
p-value = 0.77

 Usually: 
 $$\alpha = 0.05$$ 
Since our p-value is greater than our confidence interval, which means that the value lies withen the range of 97.5% and 2.5%. We can safely say that there is no significance difference in the mean of both distribution. 

## Extra 
I have found a way to plot a wordcloud that shows the words mostly associated with a certain feature. The below figure is showing the top words in headlines that were classified as  sentimentally negative 

In [None]:
## this part is just a showcase of one of the ways to visulaize our data.
from wordcloud import WordCloud, STOPWORDS 
stop = set(STOPWORDS)
text = ' '.join(asset_negative.loc[:,'headline'].str.lower().values)
wordcloud = WordCloud(max_font_size=None, stopwords=stop, background_color='white',
                      width=1200, height=1000).generate(text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud)
plt.title('Top  words in headlines classified as negative of Apple')
plt.axis("off")
plt.show();

## Further Research
This is a huge dataset with many variables that may, or may not have any significant correlation with our target prediction. However, taking the adjustment effect of adding multiple variables, I believe that by using some of the supervised machine learning algorithms that test for multiple variables, such as neural networks. I would like to apply statistical analysis on the dataset using ANOVA or a similar technique to sutdy multivariable predictions.  I would also consider applying a form of factoring, Principle Component Aanalysis or BARRA factors, which are very useful in the case of financial analysis. 