# HIGH FREQUENCY WORDS USING NLTK
## Required Packages

In [2]:
import nltk
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

## Corpus

In [3]:

with open('Datasets/finalproj_corpus.txt', 'r', encoding='utf-8') as f:
    for line in f:
        print(line[0:1000])


US News – Top national stories and latest headlines - CNN Open Menu US Crime + Justice Energy + Environment More Extreme Weather Space + Science Audio Search CNN Crime + Justice Energy + Environment Extreme Weather Space + Science Search Audio Edition US International Arabic Español Edition US International Arabic Español US Crime + Justice Energy + Environment Extreme Weather Space + Science World Africa Americas Asia Australia China Europe India Middle East United Kingdom Politics The Biden Presidency Facts First US Elections Business Markets Tech Media Success Perspectives Videos Opinion Political Op-Eds Social Commentary Health Life, But Better Fitness Food Sleep Mindfulness Relationships Entertainment Stars Screen Binge Culture Media Tech Innovate Gadget Foreseeable Future Mission: Ahead Upstarts Work Transformed Innovative Cities Style Arts Design Fashion Architecture Luxury Beauty Video Travel Destinations Food and Drink Stay News Videos Sports Pro Football College Football Bask

## Data Preprocessing
### Cleaning

In [4]:
import re
# def remove(x):
#     pattern = "['\n',@\'?\.$%_0-9]"
#     line = [re.sub(pattern, '', line) for i in line]
#     return x

In [17]:

line = re.sub(r"[^a-zA-Z]/g", '',line).lower() # Block-not this characters: words + spaces-Block
print(line)


      quotes displayed in real-time or delayed by at least  minutes market data provided by factset  powered and implemented by factset digital solutions  legal statement  mutual fund and etf data provided by refinitiv lipper  facebook twitter instagram rss email opinion fox news digital opinion  hours ago the nsba addressed joe biden in a letter expressing american public schools are under an immediate threat due to mask mandates crt gender ideology and more ap opinion  hours ago the medicare for all act of  which i have just introduced with  co-sponsors would provide comprehensive health care coverage to every man woman and child in our country reuters/scott audette/file photo opinion  hours ago big business is no friend to conservatives—that’s been clear for years and it’s increasingly no friend to america  hours ago now more than ever police should know that american politicians have their backs just as they have ours  hours ago the medicare for all act of  which i have just introd

In [18]:
split_text = line.split('  ')
split_text[0:20]

['',
 '',
 '',
 'quotes displayed in real-time or delayed by at least',
 'minutes market data provided by factset',
 'powered and implemented by factset digital solutions',
 'legal statement',
 'mutual fund and etf data provided by refinitiv lipper',
 'facebook twitter instagram rss email opinion fox news digital opinion',
 'hours ago the nsba addressed joe biden in a letter expressing american public schools are under an immediate threat due to mask mandates crt gender ideology and more ap opinion',
 'hours ago the medicare for all act of',
 'which i have just introduced with',
 'co-sponsors would provide comprehensive health care coverage to every man woman and child in our country reuters/scott audette/file photo opinion',
 'hours ago big business is no friend to conservatives—that’s been clear for years and it’s increasingly no friend to america',
 'hours ago now more than ever police should know that american politicians have their backs just as they have ours',
 'hours ago the me

In [19]:
# Concatenate
# import pandas as pd
data = line
df = pd.DataFrame([x.split(',') for x in data.split('  ') ], columns=['news_headlines'])
print(df.head())

                                      news_headlines
0                                                   
1                                                   
2                                                   
3  quotes displayed in real-time or delayed by at...
4            minutes market data provided by factset


So this dataset contains only one column, I will now move to the task of adding labels to the dataset. I will start by adding four new columns to this dataset as Positive, Negative, Neutral, and Compound by calculating the sentiment scores of the column containing textual data:

In [21]:
df_final = df.drop([0,1,2])
df_final.head()

Unnamed: 0,news_headlines
3,quotes displayed in real-time or delayed by at...
4,minutes market data provided by factset
5,powered and implemented by factset digital sol...
6,legal statement
7,mutual fund and etf data provided by refinitiv...


### Tokenize

In [None]:
# # import TweetTokenizer() method from nltk
# from nltk.tokenize import TweetTokenizer
#
# # Create a reference variable for Class TweetTokenizer
# tk = TweetTokenizer()
# # Use tokenize method
# news_tokens = tk.tokenize(line)
#
# print(news_tokens[0:20])

### Add Labels

To add labels to unlabeled data for sentiment analysis, we can use the Vader sentiment model which is one of the best approaches for sentiment analysis. We can access it using the NLTK library in Python. Let’s import the necessary Python libraries and an unlabeled dataset that we need for the task of adding labels to a data for sentiment analysis:

In [23]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


The NLTK Vader provides the sentiment of the text by providing scores in three different categories i.e. negative, neutral, and positive.

Along with this, the compound is also calculated for each text passed to the NLTK Vader function.
The compound attribute is basically a summarized result of all the three categories score.

The value of compound ranges from -1(most extreme negative) and +1 (most extreme positive).
This is normalized value as it helps in better analysis and further usage.

In [24]:
df_final['neg'] = df_final['news_headlines'].apply(lambda x:sia.polarity_scores(x)['neg'])
df_final['neu'] = df_final['news_headlines'].apply(lambda x:sia.polarity_scores(x)['neu'])
df_final['pos'] = df_final['news_headlines'].apply(lambda x:sia.polarity_scores(x)['pos'])
df_final['compound'] = df_final['news_headlines'].apply(lambda x:sia.polarity_scores(x)['compound'])

In [28]:
df_final.head()

Unnamed: 0,news_headlines,neg,neu,pos,compound
3,quotes displayed in real-time or delayed by at...,0.192,0.808,0.0,-0.2263
4,minutes market data provided by factset,0.0,1.0,0.0,0.0
5,powered and implemented by factset digital sol...,0.0,0.779,0.221,0.1779
6,legal statement,0.0,0.4,0.6,0.128
7,mutual fund and etf data provided by refinitiv...,0.0,1.0,0.0,0.0


In [27]:
df_final.to_csv("Datasets/headlines_preprocessed.csv")