<a href="https://colab.research.google.com/github/manishrawat2022/ReStock/blob/main/moneycontrol_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Installations

In [None]:
! /usr/bin/python3 -m pip install "pymongo[srv]"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Imports

In [None]:
import nltk
import re

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
import string

from nltk.corpus import stopwords

from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# nltk downloaded (run only once)
nltk.download('stopwords',quiet=True) # stopword library
nltk.download('wordnet', quiet=True) # wordnet library
nltk.download('words', quiet=True) # words library
nltk.download('punkt', quiet=True) # tokenize library

True

In [None]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
import pymongo
from pymongo import MongoClient
import pandas as pd
import matplotlib.pyplot as plt

#### Create MongoDB client

In [None]:
client = MongoClient('mongodb+srv://random:Random@stock.mbex3cy.mongodb.net/?retryWrites=true&w=majority')
db = client["Stocks"]
collection = db["moneycontrol"]

#### Create Pandas dataframe from Mongo Data

In [None]:
columns = ["title", "desc", "author", "timestamp" , "content" , "source" , "link", "timestamp"]
df = pd.DataFrame(columns = columns)

In [None]:
row_num = 0 
for document in collection.find():
    row = [document.get(column) for column in columns]
    df.loc[row_num] = row
    row_num +=1
  

## EDA

### DataFrame

In [None]:
df.shape

(694, 8)

In [None]:
df.head()

Unnamed: 0,title,desc,author,timestamp,content,source,link,timestamp.1
0,Surprise Tightening in Asia Ups Pressure on Do...,"This month, Thailand's baht is the worst-perfo...",Bloomberg,2022-07-16 19:53:00,Central banks in Asia that remained dovish eve...,business,https://www.moneycontrol.com/news/business/ban...,2022-07-16 19:53:00
1,Key events expected next week in India and aro...,Here are the key events to get you started for...,Moneycontrol News,2022-07-16 18:50:00,"Let's take a look at the important business, p...",business,https://www.moneycontrol.com/news/business/key...,2022-07-16 18:50:00
2,Done my best to lead NSE in difficult period: ...,"Limaye, whose five-year term ended on July 16,...",PTI,2022-07-16 18:24:00,NSE’s outgoing Managing Director and CEO Vikra...,business,https://www.moneycontrol.com/news/business/don...,2022-07-16 18:24:00
3,Moneycontrol Selects: Top stories this evening,Our specially curated package of the most inte...,Moneycontrol News,2022-07-16 20:09:00,Here are the top stories this evening: HDFC Ba...,business,https://www.moneycontrol.com/news/business/mon...,2022-07-16 20:09:00
4,"Coinbase ""temporarily shutting down"" US affili...","According to a Business Insider article, crypt...",Moneycontrol News,2022-07-16 14:49:00,"The report, quoting emails that were sent to t...",business,https://www.moneycontrol.com/news/business/cry...,2022-07-16 14:49:00


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 694 entries, 0 to 693
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   title      694 non-null    object        
 1   desc       616 non-null    object        
 2   author     611 non-null    object        
 3   timestamp  694 non-null    datetime64[ns]
 4   content    694 non-null    object        
 5   source     694 non-null    object        
 6   link       694 non-null    object        
 7   timestamp  694 non-null    datetime64[ns]
dtypes: datetime64[ns](2), object(6)
memory usage: 48.8+ KB


We can see that 78 null values exist in the desc column.

83 null values exist in the author column.

The column Timestamp appears twice. One instance will be deleted.

In [None]:
df = df.loc[:,~df.columns.duplicated()].copy()

### title Column

In [None]:
# Number of unique titles
df.title.nunique()

661

This means that 33 articles are not unique or have the same title. Perhaps there is duplication of data.

In [None]:
df.title.is_unique

False

In [None]:
df.duplicated().any()

True

This means duplicated data exists in the entire dataset

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

(669, 7)

Only 25 rows were duplicate. Some titles may be same

In [None]:
df.title.value_counts()

Moneycontrol Selects: Top stories this evening                                 4
Agri Picks Report: Geojit                                                      3
Stock Market Today: Top 10 things to know before the market opens today        3
Trade setup for today: Top 15 things to know before the opening bell           2
Surprise Tightening in Asia Ups Pressure on Dovish Central Banks               1
                                                                              ..
Saudi Arabia expected to grant access to Israeli air travel: US official       1
Rishi Sunak tops second round of voting in race to become UK Prime Minister    1
Kansai Nerolac Q1 PAT seen up 19.7% YoY to Rs. 142 cr: ICICI Direct            1
Transport Corp Q1 PAT seen up 25.9% YoY to Rs. 60.9 cr: ICICI Direct           1
Buy Avenue Supermarts; target of Rs 4971: YES Securities                       1
Name: title, Length: 661, dtype: int64

We can see the 8 titles which are not unique because of belonging to a series of articles under the same title.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 669 entries, 0 to 693
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   title      669 non-null    object        
 1   desc       592 non-null    object        
 2   author     586 non-null    object        
 3   timestamp  669 non-null    datetime64[ns]
 4   content    669 non-null    object        
 5   source     669 non-null    object        
 6   link       669 non-null    object        
dtypes: datetime64[ns](1), object(6)
memory usage: 41.8+ KB


After removal of duplicate rows:

77 null values exist in the desc column.

83 null values exist in the author column.

In [None]:
df.head()

Unnamed: 0,title,desc,author,timestamp,content,source,link
0,Surprise Tightening in Asia Ups Pressure on Do...,"This month, Thailand's baht is the worst-perfo...",Bloomberg,2022-07-16 19:53:00,Central banks in Asia that remained dovish eve...,business,https://www.moneycontrol.com/news/business/ban...
1,Key events expected next week in India and aro...,Here are the key events to get you started for...,Moneycontrol News,2022-07-16 18:50:00,"Let's take a look at the important business, p...",business,https://www.moneycontrol.com/news/business/key...
2,Done my best to lead NSE in difficult period: ...,"Limaye, whose five-year term ended on July 16,...",PTI,2022-07-16 18:24:00,NSE’s outgoing Managing Director and CEO Vikra...,business,https://www.moneycontrol.com/news/business/don...
3,Moneycontrol Selects: Top stories this evening,Our specially curated package of the most inte...,Moneycontrol News,2022-07-16 20:09:00,Here are the top stories this evening: HDFC Ba...,business,https://www.moneycontrol.com/news/business/mon...
4,"Coinbase ""temporarily shutting down"" US affili...","According to a Business Insider article, crypt...",Moneycontrol News,2022-07-16 14:49:00,"The report, quoting emails that were sent to t...",business,https://www.moneycontrol.com/news/business/cry...


### desc column

In [None]:
df.desc.nunique()

584

In [None]:
df.desc.value_counts()[:5]

Our specially curated package of the most interesting articles to help you stay at the top of your game.                                                                                                                                                               4
Stocks to Watch: Check out the companies making headlines before the opening bell.                                                                                                                                                                                     3
Gotabaya Rajapaksa, the 73-year-old leader who had promised to resign on Wednesday, appointed Prime Minister Ranil Wickremesinghe as the acting President hours after he fled the country, escalating the political crisis and triggering a fresh wave of protests.    3
The duty cut brought down the petrol price in Delhi by Rs 9.5 and Rs 7 for diesel. Petrol in Delhi costs Rs 96.72 and diesel Rs 89.62 a litre.                                                               

There are 8 values that are same in the desc column

### author column

In [None]:
# various authors of the articles
df.author.value_counts()

Moneycontrol News    115
PTI                  111
Reuters               95
Broker Research       86
Bloomberg             17
                    ... 
Maryam Farooqui        1
KT Jagannathan         1
Vatsala Kamat          1
Anjali Kochhar         1
Pravesh Gour           1
Name: author, Length: 88, dtype: int64

In [None]:
# top 10 authors by count
df.author.value_counts()[:10]

Moneycontrol News       115
PTI                     111
Reuters                  95
Broker Research          86
Bloomberg                17
Sunil Shankar Matkar     11
Sandip Das                8
Debangana Ghosh           5
AFP                       5
Mansi Verma               5
Name: author, dtype: int64

### timestamp column

In [None]:
# This column appears twice in the dataset
from datetime import datetime

In [None]:
df['Date'] = pd.to_datetime(df['timestamp']).dt.date

In [None]:
df.head()

Unnamed: 0,title,desc,author,timestamp,content,source,link,Date
0,Surprise Tightening in Asia Ups Pressure on Do...,"This month, Thailand's baht is the worst-perfo...",Bloomberg,2022-07-16 19:53:00,Central banks in Asia that remained dovish eve...,business,https://www.moneycontrol.com/news/business/ban...,2022-07-16
1,Key events expected next week in India and aro...,Here are the key events to get you started for...,Moneycontrol News,2022-07-16 18:50:00,"Let's take a look at the important business, p...",business,https://www.moneycontrol.com/news/business/key...,2022-07-16
2,Done my best to lead NSE in difficult period: ...,"Limaye, whose five-year term ended on July 16,...",PTI,2022-07-16 18:24:00,NSE’s outgoing Managing Director and CEO Vikra...,business,https://www.moneycontrol.com/news/business/don...,2022-07-16
3,Moneycontrol Selects: Top stories this evening,Our specially curated package of the most inte...,Moneycontrol News,2022-07-16 20:09:00,Here are the top stories this evening: HDFC Ba...,business,https://www.moneycontrol.com/news/business/mon...,2022-07-16
4,"Coinbase ""temporarily shutting down"" US affili...","According to a Business Insider article, crypt...",Moneycontrol News,2022-07-16 14:49:00,"The report, quoting emails that were sent to t...",business,https://www.moneycontrol.com/news/business/cry...,2022-07-16


In [None]:
type(df['Date'][0])

datetime.date

In [None]:
df.Date.value_counts()

2022-07-13    240
2022-07-14    211
2022-07-15    180
2022-07-16     31
2022-07-12      7
Name: Date, dtype: int64

### content column

In [None]:
df['content'][0]

'Central banks in Asia that remained dovish even in the face of soaring inflation may see their resolve tested after a surprise tightening by peers in the region leaves their currencies vulnerable to sell-off, according to economists. Thailand, which has kept its key rate at a record low to bolster the economy’s recovery, is seeing the baht emerge as this month’s worst performer out of 12 Asian currencies tracked by Bloomberg. The Indonesian rupiah weakened for the sixth straight week amid foreign outflows driven by the nation’s widening monetary policy gap with the US. “Wobbly exchange rate, and an increasingly determined Fed are adding to the urgency for monetary tightening in many Asian markets,” said Frederic Neumann, chief Asia economist at HSBC Holdings Plc. “As interest rate hikes are delivered in quick succession elsewhere in the region, central banks in Thailand and Indonesia might now speed up their own responses.” he added. Central Banks Keep Surprising With Hikes as Inflati

This consists of the complete article text.

### source column

In [None]:
# Source column tells us that all are business related news
df.source.value_counts()

business    669
Name: source, dtype: int64

### link column

In [None]:
df.link.nunique()

669

All links are unique.

Which means that all the data is collected from different URLs.

In [None]:
df.head()

Unnamed: 0,title,desc,author,timestamp,content,source,link,Date
0,Surprise Tightening in Asia Ups Pressure on Do...,"This month, Thailand's baht is the worst-perfo...",Bloomberg,2022-07-16 19:53:00,Central banks in Asia that remained dovish eve...,business,https://www.moneycontrol.com/news/business/ban...,2022-07-16
1,Key events expected next week in India and aro...,Here are the key events to get you started for...,Moneycontrol News,2022-07-16 18:50:00,"Let's take a look at the important business, p...",business,https://www.moneycontrol.com/news/business/key...,2022-07-16
2,Done my best to lead NSE in difficult period: ...,"Limaye, whose five-year term ended on July 16,...",PTI,2022-07-16 18:24:00,NSE’s outgoing Managing Director and CEO Vikra...,business,https://www.moneycontrol.com/news/business/don...,2022-07-16
3,Moneycontrol Selects: Top stories this evening,Our specially curated package of the most inte...,Moneycontrol News,2022-07-16 20:09:00,Here are the top stories this evening: HDFC Ba...,business,https://www.moneycontrol.com/news/business/mon...,2022-07-16
4,"Coinbase ""temporarily shutting down"" US affili...","According to a Business Insider article, crypt...",Moneycontrol News,2022-07-16 14:49:00,"The report, quoting emails that were sent to t...",business,https://www.moneycontrol.com/news/business/cry...,2022-07-16


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 669 entries, 0 to 693
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   title      669 non-null    object        
 1   desc       592 non-null    object        
 2   author     586 non-null    object        
 3   timestamp  669 non-null    datetime64[ns]
 4   content    669 non-null    object        
 5   source     669 non-null    object        
 6   link       669 non-null    object        
 7   Date       669 non-null    object        
dtypes: datetime64[ns](1), object(7)
memory usage: 63.2+ KB


In [None]:
df.loc[:,'content']

0      Central banks in Asia that remained dovish eve...
1      Let's take a look at the important business, p...
2      NSE’s outgoing Managing Director and CEO Vikra...
3      Here are the top stories this evening: HDFC Ba...
4      The report, quoting emails that were sent to t...
                             ...                        
689    France's new foreign minister said on Tuesday ...
690    Oil prices fell in early Asian trading on Wedn...
691    The Supreme Court on July 12 put a proposal fo...
692    Electric commercial vehicle maker PMI Electro ...
693    D-Mart registered better than expected revenue...
Name: content, Length: 669, dtype: object

In [None]:
portfolio = ['HINDUNILVR', 'TATAMOTORS', 'LT', 'MARUTI', 'HDFC', 'BPCL', 'BHARTIARTL', 'CIPLA', 'BAJFINANCE', 'ULTRACEMCO', 'ITC', 'INFY', 'TATASTEEL']


In [None]:
def lower_case(text):
  return text.lower()

def remove_punctuation(text):
  return re.sub('[^a-zA-Z]',' ', str(text))

def normalize_document(text):
    text = remove_punctuation(text)
    text = lower_case(text)
    return text

In [None]:
df['content1'] = df['content'].apply(normalize_document)

In [None]:
df.head(5)

Unnamed: 0,title,desc,author,timestamp,content,source,link,Date,content1
0,Surprise Tightening in Asia Ups Pressure on Do...,"This month, Thailand's baht is the worst-perfo...",Bloomberg,2022-07-16 19:53:00,Central banks in Asia that remained dovish eve...,business,https://www.moneycontrol.com/news/business/ban...,2022-07-16,central banks in asia that remained dovish eve...
1,Key events expected next week in India and aro...,Here are the key events to get you started for...,Moneycontrol News,2022-07-16 18:50:00,"Let's take a look at the important business, p...",business,https://www.moneycontrol.com/news/business/key...,2022-07-16,let s take a look at the important business p...
2,Done my best to lead NSE in difficult period: ...,"Limaye, whose five-year term ended on July 16,...",PTI,2022-07-16 18:24:00,NSE’s outgoing Managing Director and CEO Vikra...,business,https://www.moneycontrol.com/news/business/don...,2022-07-16,nse s outgoing managing director and ceo vikra...
3,Moneycontrol Selects: Top stories this evening,Our specially curated package of the most inte...,Moneycontrol News,2022-07-16 20:09:00,Here are the top stories this evening: HDFC Ba...,business,https://www.moneycontrol.com/news/business/mon...,2022-07-16,here are the top stories this evening hdfc ba...
4,"Coinbase ""temporarily shutting down"" US affili...","According to a Business Insider article, crypt...",Moneycontrol News,2022-07-16 14:49:00,"The report, quoting emails that were sent to t...",business,https://www.moneycontrol.com/news/business/cry...,2022-07-16,the report quoting emails that were sent to t...


In [None]:
df.loc[:,['content','content1']].head()

Unnamed: 0,content,content1
0,Central banks in Asia that remained dovish eve...,central banks in asia that remained dovish eve...
1,"Let's take a look at the important business, p...",let s take a look at the important business p...
2,NSE’s outgoing Managing Director and CEO Vikra...,nse s outgoing managing director and ceo vikra...
3,Here are the top stories this evening: HDFC Ba...,here are the top stories this evening hdfc ba...
4,"The report, quoting emails that were sent to t...",the report quoting emails that were sent to t...


In [None]:
df['token_content1'] = df['content1'].apply(lambda x : nltk.word_tokenize(x))

In [None]:
df.head()

Unnamed: 0,title,desc,author,timestamp,content,source,link,Date,content1,token_content1
0,Surprise Tightening in Asia Ups Pressure on Do...,"This month, Thailand's baht is the worst-perfo...",Bloomberg,2022-07-16 19:53:00,Central banks in Asia that remained dovish eve...,business,https://www.moneycontrol.com/news/business/ban...,2022-07-16,central banks in asia that remained dovish eve...,"[central, banks, in, asia, that, remained, dov..."
1,Key events expected next week in India and aro...,Here are the key events to get you started for...,Moneycontrol News,2022-07-16 18:50:00,"Let's take a look at the important business, p...",business,https://www.moneycontrol.com/news/business/key...,2022-07-16,let s take a look at the important business p...,"[let, s, take, a, look, at, the, important, bu..."
2,Done my best to lead NSE in difficult period: ...,"Limaye, whose five-year term ended on July 16,...",PTI,2022-07-16 18:24:00,NSE’s outgoing Managing Director and CEO Vikra...,business,https://www.moneycontrol.com/news/business/don...,2022-07-16,nse s outgoing managing director and ceo vikra...,"[nse, s, outgoing, managing, director, and, ce..."
3,Moneycontrol Selects: Top stories this evening,Our specially curated package of the most inte...,Moneycontrol News,2022-07-16 20:09:00,Here are the top stories this evening: HDFC Ba...,business,https://www.moneycontrol.com/news/business/mon...,2022-07-16,here are the top stories this evening hdfc ba...,"[here, are, the, top, stories, this, evening, ..."
4,"Coinbase ""temporarily shutting down"" US affili...","According to a Business Insider article, crypt...",Moneycontrol News,2022-07-16 14:49:00,"The report, quoting emails that were sent to t...",business,https://www.moneycontrol.com/news/business/cry...,2022-07-16,the report quoting emails that were sent to t...,"[the, report, quoting, emails, that, were, sen..."


In [None]:
stops = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
#ps = PorterStemmer()
def correct_text(text, lemma=True):
    sample = text
    #removing stopwords
    #sample = sample.lower()
    sample = [word for word in sample if not word in stops]
    sample = ' '.join(sample)
    if lemma:
        sample = sample.split()
        sample = [lemmatizer.lemmatize(word) for word in sample]
        sample = ' '.join(sample)
        
    
    return sample

In [None]:
df['content2'] = df['token_content1'].apply(correct_text)

In [None]:
df['token_content2'] = df['content2'].apply(lambda x : nltk.word_tokenize(x))

In [None]:
df['token_content2'].apply(lambda x : len(x)).sum()

119331

In [None]:
df.loc[:,['content','content1', 'content2']].head()

Unnamed: 0,content,content1,content2
0,Central banks in Asia that remained dovish eve...,central banks in asia that remained dovish eve...,central bank asia remained dovish even face so...
1,"Let's take a look at the important business, p...",let s take a look at the important business p...,let take look important business political eco...
2,NSE’s outgoing Managing Director and CEO Vikra...,nse s outgoing managing director and ceo vikra...,nse outgoing managing director ceo vikram lima...
3,Here are the top stories this evening: HDFC Ba...,here are the top stories this evening hdfc ba...,top story evening hdfc bank net profit almost ...
4,"The report, quoting emails that were sent to t...",the report quoting emails that were sent to t...,report quoting email sent three creator shared...


In [None]:
only_english = set(nltk.corpus.words.words())
def clean_alphaneumeric_text(text):
    
    sample = text
    sample = re.sub(r"\S*https?:\S*", '', sample) #links and urls
    sample = re.sub('\[.*?\]', '', sample) #text between [square brackets]
    sample = re.sub('\(.*?\)', '', sample) #text between (parenthesis)
    sample = re.sub('[%s]' % re.escape(string.punctuation), '', sample) #punctuations
    sample = re.sub('\w*\d\w', '', sample) #digits with trailing or preceeding text
    sample = re.sub(r'\n', ' ', sample) #new line character
    sample = re.sub(r'\\n', ' ', sample) #new line character
    sample = re.sub("[''""...“”‘’…]", '', sample) #list of quotation marks
    sample = re.sub(r', /<[^>]+>/', '', sample)    #HTML attributes
    
    sample = ' '.join([w for w in nltk.wordpunct_tokenize(sample) if w.lower() in only_english or not w.isalpha()]) #doesn't remove indian languages
    sample = ' '.join(list(filter(lambda ele: re.search("[a-zA-Z\s]+", ele) is not None, sample.split()))) #languages other than english
    
    sample = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE).sub(r'', sample) #emojis and symbols
    sample = sample.strip()
    sample = " ".join([x.strip() for x in sample.split()])
    
    return sample

In [None]:
df['content3'] = df['content2'].apply(lambda x: clean_alphaneumeric_text(x))

df['content3'] = df['content2'].apply(clean_alphaneumeric_text)

In [None]:
df.loc[:,['content','content1', 'content2','content3']].head()

Unnamed: 0,content,content1,content2,content3
0,Central banks in Asia that remained dovish eve...,central banks in asia that remained dovish eve...,central bank asia remained dovish even face so...,central bank dovish even face soaring inflatio...
1,"Let's take a look at the important business, p...",let s take a look at the important business p...,let take look important business political eco...,let take look important business political eco...
2,NSE’s outgoing Managing Director and CEO Vikra...,nse s outgoing managing director and ceo vikra...,nse outgoing managing director ceo vikram lima...,outgoing director said done best lead exchange...
3,Here are the top stories this evening: HDFC Ba...,here are the top stories this evening hdfc ba...,top story evening hdfc bank net profit almost ...,top story evening bank net profit almost fifth...
4,"The report, quoting emails that were sent to t...",the report quoting emails that were sent to t...,report quoting email sent three creator shared...,report sent three creator medium outlet noted ...
