# Time Series Final Project
## News Articles EDA
In the following part, we will focus on obtaining sentiment scores for news articles centered around technology topics. More specifically, we would like to focus on articles mentioning companies that are captured under the MiraeAsset TIGER US Tech Top10 INDXX (https://www.bloomberg.com/quote/381170:KS#xj4y7vzkg)

Companies under the index are:
* Microsoft Corp. MSFT:US.
* Apple Inc. AAPL:US. 1.3579M.
* Alphabet Inc. GOOGL:US. 1.9099M.
* Amazon.com Inc. AMZN:US.
* NVIDIA Corp. NVDA:US.
* Meta Platforms Inc. META:US.
* Tesla Inc. TSLA:US. 402.9600K.
* Broadcom Inc. AVGO:US. 60.7200K.

In [1]:
import pandas as pd

from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english")

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [2]:
df_news = pd.read_parquet('https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquet', engine='pyarrow')

In [3]:
df_news.head()

Unnamed: 0,url,date,language,title,text
0,http://en.people.cn/n3/2021/0318/c90000-983012...,2021-03-18,en,Artificial intelligence improves parking effic...,\n\nArtificial intelligence improves parking e...
1,http://newsparliament.com/2020/02/27/children-...,2020-02-27,en,Children With Autism Saw Their Learning and So...,\nChildren With Autism Saw Their Learning and ...
2,http://www.dataweek.co.za/12835r,2021-03-26,en,"Forget ML, AI and Industry 4.0 – obsolescence ...","\n\nForget ML, AI and Industry 4.0 – obsolesce..."
3,http://www.homeoffice.consumerelectronicsnet.c...,2021-03-10,en,Strategy Analytics: 71% of Smartphones Sold Gl...,\n\nStrategy Analytics: 71% of Smartphones Sol...
4,http://www.itbusinessnet.com/2020/10/olympus-t...,2020-10-20,en,Olympus to Support Endoscopic AI Diagnosis Edu...,\n\nOlympus to Support Endoscopic AI Diagnosis...


Let's start doing some EDA and cleaning our text column to feed into a sentiment analysis pipeline

In [4]:
# Assuming you have a pandas DataFrame named 'df' with a 'date' column
df_news['date'] = pd.to_datetime(df_news['date'])  # Convert 'date' column to pandas Timestamp if not already

min_date = df_news['date'].min()
max_date = df_news['date'].max()

print("Minimum date:", min_date)
print("Maximum date:", max_date)

Minimum date: 2020-01-01 00:00:00
Maximum date: 2023-04-28 00:00:00


In [5]:
#pip install nltk

In [6]:
import nltk
from nltk.corpus import stopwords
import string

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
def preprocess_text(text):
    # Remove punctuation and numbers
    text = text.translate(str.maketrans('', '', string.punctuation + string.digits))
    
    # Convert text to lowercase
    text = text.lower()
    
    # Tokenize the text
    tokens = text.split()
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Join the tokens back into a single string
    preprocessed_text = ' '.join(tokens)
    
    return preprocessed_text

df_processed = df_news.copy()
df_processed['text_processed'] = df_processed['text'].apply(preprocess_text)

In [8]:
df_processed.head()

Unnamed: 0,url,date,language,title,text,text_processed
0,http://en.people.cn/n3/2021/0318/c90000-983012...,2021-03-18,en,Artificial intelligence improves parking effic...,\n\nArtificial intelligence improves parking e...,artificial intelligence improves parking effic...
1,http://newsparliament.com/2020/02/27/children-...,2020-02-27,en,Children With Autism Saw Their Learning and So...,\nChildren With Autism Saw Their Learning and ...,children autism saw learning social skills boo...
2,http://www.dataweek.co.za/12835r,2021-03-26,en,"Forget ML, AI and Industry 4.0 – obsolescence ...","\n\nForget ML, AI and Industry 4.0 – obsolesce...",forget ml ai industry – obsolescence focus feb...
3,http://www.homeoffice.consumerelectronicsnet.c...,2021-03-10,en,Strategy Analytics: 71% of Smartphones Sold Gl...,\n\nStrategy Analytics: 71% of Smartphones Sol...,strategy analytics smartphones sold globally a...
4,http://www.itbusinessnet.com/2020/10/olympus-t...,2020-10-20,en,Olympus to Support Endoscopic AI Diagnosis Edu...,\n\nOlympus to Support Endoscopic AI Diagnosis...,olympus support endoscopic ai diagnosis educat...


Filter articles that only refer to the tech companies we are interested in

In [9]:
tech_companies = ['microsoft', 'google', 'alphabet', 'meta', 'facebook', 'apple', 'amazon', 'nvidia', 'tesla', 'broadcom']

# Create a boolean mask to filter the rows based on the tech companies mentioned
mask = df_processed['text_processed'].str.contains('|'.join(tech_companies), case=False)

# Apply the mask to filter the DataFrame
filtered_df = df_processed[mask]

In [10]:
filtered_df['text_processed'][0]

'artificial intelligence improves parking efficiency chinese cities peoples daily online home china politics foreign affairs opinions video china business military world society culture travel science sports photo languages chinese japanese french spanish russian arabic korean german portuguese thursday march home artificial intelligence improves parking efficiency chinese cities liu shiyao peoples daily march photo taken july shows sign electronic toll collection etc newly set roadside parking space yangzhuang road shijingshan district beijing urban areas city started use etc system roadside parking spaces since july people’s daily onlineli wenming thanks application artificial intelligence aiempowered roadside electronic toll collection etc system china’s capital city beijing seen significant improvement efficiency parking fee collection turnover roadside parking spots order roadside parking well traffic congestion city deepens roadside parking reform etc system almost covered roadsi

In [11]:
def get_sentiment_score(text):
    max_length = 512
    split_text = [text[i:i + max_length] for i in range(0, len(text), max_length)]
    scores = []

    for chunk in split_text:
        # Call your sentiment analysis model to obtain the sentiment result
        sentiment_result = sentiment_analysis(chunk)

        # Extract the sentiment label and score from the result
        sentiment_label = sentiment_result[0]['label']
        sentiment_score = sentiment_result[0]['score']
        
        numeric_score = 0.5
        # Map sentiment labels to numeric values
        if sentiment_label == 'POSITIVE':
            numeric_score = sentiment_score
        elif sentiment_label == 'NEGATIVE':
            numeric_score = 1 - sentiment_score

        scores.append(numeric_score)

    # Calculate weighted average of sentiment scores
    total_length = len(text)
    weighted_scores = [score * len(chunk) / total_length for score, chunk in zip(scores, split_text)]
    overall_score = sum(weighted_scores)

    return overall_score

# Example usage
text = "this company is so bad"
overall_score = get_sentiment_score(text)
print("Overall Sentiment Score:", overall_score)

Overall Sentiment Score: 0.000518500804901123


In [None]:
filtered_df['sentiment_score'] = filtered_df['text_processed'].apply(get_sentiment_score)

In [None]:
import pyarrow.parquet as pq
import gcsfs

# Save DataFrame to Parquet format
file_path = 'filtered_sentiment.parquet'
pq.write_table(pq.Table.from_pandas(df), file_path)

# Save Parquet file to GCS bucket
bucket_name = 'msca-time-series-bucket'
gcs_path = f'gs://{bucket_name}/{file_path}'

fs = gcsfs.GCSFileSystem()
with fs.open(gcs_path, 'wb') as f:
    pq.write_table(pq.Table.from_pandas(filtered_df), f)

Installing huggingface transformers

In [5]:
#pip install transformers

We will use this sentiment analysis transformer: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

In [None]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")