# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


In [2]:
%help

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.3 



# Available Magic Commands

## Sessions Magic

----
    %help                             Return a list of descriptions and input types for all magic commands. 
    %profile            String        Specify a profile in your aws configuration to use as the credentials provider.
    %region             String        Specify the AWS region in which to initialize a session. 
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\ USERNAME \.aws\config" on Windows.
    %idle_timeout       Int           The number of minutes of inactivity after which a session will timeout. 
                                      Default: 2880 minutes (48 hours).
    %session_id_prefix  String        Define a String that will precede all session IDs in the format 
                                      [session_id_prefix]-[session_id]. If a session ID is not provided,
                                      a random UUID will be generated.
    %status                           Returns the status of the current Glue session including its duration, 
                                      configuration and executing user / role.
    %session_id                       Returns the session ID for the running session. 
    %list_sessions                    Lists all currently running sessions by ID.
    %stop_session                     Stops the current session.
    %glue_version       String        The version of Glue to be used by this session. 
                                      Currently, the only valid options are 2.0 and 3.0. 
                                      Default: 2.0.
----

## Selecting Job Types

----
    %streaming          String        Sets the session type to Glue Streaming.
    %etl                String        Sets the session type to Glue ETL.
    %glue_ray           String        Sets the session type to Glue Ray.
----

## Glue Config Magic 
*(common across all job types)*

----

    %%configure         Dictionary    A json-formatted dictionary consisting of all configuration parameters for 
                                      a session. Each parameter can be specified here or through individual magics.
    %iam_role           String        Specify an IAM role ARN to execute your session with.
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\%USERNAME%\.aws\config` on Windows.
    %number_of_workers  int           The number of workers of a defined worker_type that are allocated 
                                      when a session runs.
                                      Default: 5.
    %additional_python_modules  List  Comma separated list of additional Python modules to include in your cluster 
                                      (can be from Pypi or S3).
----

                                      
## Magic for Spark Jobs (ETL & Streaming)

----
    %worker_type        String        Set the type of instances the session will use as workers. 
                                      ETL and Streaming support G.1X, G.2X, G.4X and G.8X. 
                                      Default: G.1X.
    %connections        List          Specify a comma separated list of connections to use in the session.
    %extra_py_files     List          Comma separated list of additional Python files From S3.
    %extra_jars         List          Comma separated list of additional Jars to include in the cluster.
    %spark_conf         String        Specify custom spark configurations for your session. 
                                      E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
----
                                      
## Magic for Ray Job

----
    %min_workers        Int           The minimum number of workers that are allocated to a Ray job. 
                                      Default: 1.
    %object_memory_head Int           The percentage of free memory on the instance head node after a warm start. 
                                      Minimum: 0. Maximum: 100.
    %object_memory_worker Int         The percentage of free memory on the instance worker nodes after a warm start. 
                                      Minimum: 0. Maximum: 100.
----

## Action Magic

----

    %%sql               String        Run SQL code. All lines after the initial %%sql magic will be passed
                                      as part of the SQL code.  
----



####  Run this cell to set up and start your interactive session.


In [1]:
%idle_timeout 30
%glue_version 3.0
%worker_type G.1X
%number_of_workers 2
%additional_python_modules polygon-api-client, nltk, transformers, beautifulsoup4, termcolor

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import pandas as pd
import matplotlib.pyplot as plt
import time

from polygon import RESTClient
from polygon.rest.models import *

from termcolor import colored as cl
import requests
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Current idle_timeout is 30 minutes.
idle_timeout has been set to 30 minutes.
Setting Glue version to: 3.0
Previous worker type: G.1X
Setting new worker type to: G.1X
Previous number of workers: 2
Setting new number of workers to: 2
Additional python modules to be included:
polygon-api-client
nltk
transformers
beautifulsoup4
termcolor
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 2
Session ID: e0725d4a-6d6f-4f7c-998f-ee0707db75d7
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.3
--enable-glue-datacatalog true
--additional-python-modules polygon-api-client,nltk,transformers,beautifulsoup4,termcolor
Waiting for session e0725d4a-6d6f-4f7c-998f-ee0707db75d7 to get into ready status...
Session e0725d4a-6d6f-4f7c-998f-ee0707db75d7 has been created.



#### Example: Create a DynamicFrame from a table in the AWS Glue Data Catalog and display its schema


In [2]:
dyf = glueContext.create_dynamic_frame.from_catalog(database='project', table_name='kaggle')
dyf.printSchema()

root
|-- index_key: long
|-- title: string
|-- date: string
|-- ticker: string


#### Example: Convert the DynamicFrame to a Spark DataFrame and display a sample of the data


In [3]:
df = dyf.toDF()
df.show()

+---------+--------------------+--------------------+------+
|index_key|               title|                date|ticker|
+---------+--------------------+--------------------+------+
|        0|Stocks That Hit 5...|2020-06-05 10:30:...|     A|
|        1|Stocks That Hit 5...|2020-06-03 10:45:...|     A|
|        2|71 Biggest Movers...|2020-05-26 04:30:...|     A|
|        3|46 Stocks Moving ...|2020-05-22 12:45:...|     A|
|        4|B of A Securities...|2020-05-22 11:38:...|     A|
|        5|CFRA Maintains Ho...|2020-05-22 11:23:...|     A|
|        6|UBS Maintains Neu...|2020-05-22 09:36:...|     A|
|        7|Agilent Technolog...|2020-05-22 09:07:...|     A|
|        8|Wells Fargo Maint...|2020-05-22 08:37:...|     A|
|        9|10 Biggest Price ...|2020-05-22 08:06:...|     A|
|       10|30 Stocks Moving ...|2020-05-22 07:18:...|     A|
|       11|SVB Leerink Maint...|2020-05-22 05:14:...|     A|
|       12|8 Stocks Moving I...|2020-05-21 16:53:...|     A|
|       13|Agilent Techn

#### Example: Write the data in the DynamicFrame to a location in Amazon S3 and a table for it in the AWS Glue Data Catalog


In [2]:
import pandas as pd
from awsglue.dynamicframe import DynamicFrame

def write_aws_table(
    pandas_df, 
    s3_path, 
    glue_context, 
    spark_session, 
    db_name, 
    table_name
):
    s3output = glue_context.getSink(
      path=s3_path,
      connection_type="s3",
      updateBehavior="UPDATE_IN_DATABASE",
      partitionKeys=[],
      compression="snappy",
      enableUpdateCatalog=True,
      transformation_ctx="s3output",
    )
    s3output.setCatalogInfo(
      catalogDatabase=db_name, catalogTableName=table_name
    )
    s3output.setFormat("glueparquet")
    data_spark_df = spark.createDataFrame(pandas_df)
    data_dyf = DynamicFrame.fromDF(data_spark_df, glue_context, f"{table_name}_df")
    s3output.writeFrame(data_dyf)




In [13]:
# !pip install nltk
# !pip install bs4

In [14]:
# Esta el la libreria que vamos a usar !pip install polygon-api-client
# Esta es la documentación https://polygon.readthedocs.io/en/latest/Stocks.html
# !pip install polygon-api-client

In [3]:
def get_ticker_news(ticker, rest_client, published_dt = '2020-01-01', limit = 1000):
  # Aqui llamo a la función de para obtener los datos de los clientes
  # Este argumento published_utc_gt nos trea los datos de la fecha que pongamos hasta la fecha presente (hasta dos años de datos)
  news = []
  global requests_count

  paginate_iterator = client.list_ticker_news(
      ticker,
      published_utc_gt= published_dt,
      order="desc",
      limit=limit
      )
  
  requests_count += 1
  print(requests_count)
  
  for n in paginate_iterator:
      news.append(n)

  return news




In [4]:
# Metodo para obtener los datos que necesitamos de Json de la respuesta
def parse_news(news, ticker):
  parsed_data = []
  for index, item in enumerate(news):
    # verify this is an agg
    if isinstance(item, TickerNews):

        #print(item.tickers,item.published_utc, item.title)

        parsed_data.append([ticker, item.published_utc, item.title, item.description])

        # Aqui simplemente puse este break pra hacer pruebas peroo se deb comentar para cargar toda la data
        # if index == 20:
        #     break

  return parsed_data




In [5]:
def get_news_as_pandas_df(
    tickers,
    rest_client,
    news_publish_dt = '2020-01-01',
    news_limit = 1000,
    df_columns= ["_1"]
    ):
  total_parsed_news = []
  global requests_count

  for ticker in tickers:
    news = get_ticker_news(ticker, rest_client, news_publish_dt, news_limit)
    parsed_ticker_news = parse_news(news, ticker)
    total_parsed_news.extend(parsed_ticker_news)

    if (requests_count) % 5 == 0:
      print("Starting 1m waiting time for API usage")
      time.sleep(60)
      print("Wait is over")


  parsed_news_df = pd.DataFrame(
      total_parsed_news,
      columns= df_columns
      )

  return parsed_news_df






In [6]:
# tickers = ['WFC','TGT']
# df_columns = ['ticker', 'date_time', 'title', 'description']
# # client = RESTClient(api_key="r8cKrIhAXeavjRGWCWurkGxfnigdrs7q")
# client = RESTClient(api_key="lmYiqogaVoO0OJvfgKNpQEq4cQ_8DPFL")

# df = get_news_as_pandas_df(tickers= tickers, rest_client= client, df_columns= df_columns)
# df["date"] = pd.to_datetime(df["date"], errors='coerce', format='%Y-%m-%d %H:%M:%S', utc= True)
# df["date"] = df["date"].dt.strftime('%Y/%m/%d')

# df

df = pd.read_csv("s3://project-2023-datalake/raw/polygon/tgt_wfc_news.csv")
df

      Unnamed: 0  ... ticker
0              0  ...    WFC
1              1  ...    WFC
2              2  ...    WFC
3              3  ...    WFC
4              4  ...    WFC
...          ...  ...    ...
4244        4244  ...    TGT
4245        4245  ...    TGT
4246        4246  ...    TGT
4247        4247  ...    TGT
4248        4248  ...    TGT

[4249 rows x 4 columns]


In [7]:
# df.to_parquet("s3://project-2023-datalake/raw/polygon/news.parquet")

write_aws_table(
    df, 
    "s3://project-2023-datalake/raw/polygon/table/", 
    glueContext, 
    spark, 
    "project", 
    "raw_polygon_news"
)




# Noticias de EODHD API

In [8]:
EOD_API_KEY = "647b86e70328e2.51428274"
EOD_API_KEY_TRAVEL_EMAIL = "647b9b27e4be31.47022507"




In [9]:
def get_customized_news(
    stock,
    start_date,
    end_date,
    api_key,
    n_news=1000,
    offset = 0
    ):
    url = f'https://eodhistoricaldata.com/api/news?api_token={api_key}&s={stock}&limit={n_news}&offset={offset}&from={start_date}&to={end_date}'
    news_json = requests.get(url).json()
    
    news = []
    #print(news_json)
    for i in range(len(news_json)):
        title = news_json[-i]['title']
        date = news_json[-i]['date']
        news.append([title, date])
        print(cl('{}. '.format(i+1), attrs = ['bold']), '{}'.format(title))
    
    return news




In [10]:
def set_dataframe_from_eod_api(news, ticker):
  news_df = pd.DataFrame(news, columns=["title", "date"])
  news_df["ticker"] = ticker
  news_df["date"] = pd.to_datetime(news_df["date"], errors='coerce', format='%Y-%m-%d %H:%M:%S', utc= True)
  news_df["date"] = news_df["date"].dt.strftime('%Y/%m/%d')
  return news_df




## Noticias de Wells Fargo 2021

In [11]:
wfc_news = get_customized_news('WFC', '2020-01-01', '2021-12-31', EOD_API_KEY, 1000, 0)

wfc_news_df = pd.DataFrame(wfc_news, columns=["title", "date"])
wfc_news_df["ticker"] = "WFC"
wfc_news_df

1.  As LIBOR fades away, alternative rates get a closer look
2.  RBC Capital Markets Joins DirectBooks
3.  Wells Fargo &amp; Company Announces Partial Redemption of its Series N Preferred Stock and Related Depositary Shares
4.  Wells Fargo &amp; Company Announces Full Redemption of its Series I Preferred Stock and the 5.80% Fixed-to-Floating Rate Normal Wachovia Income Trust Securities of Wachovia Capital Trust III
5.  Wells Fargo &amp; Company Announces Full Redemptions of its Series P and Series W Preferred Stock and Related Depositary Shares
6.  Top 4th-Quarter Trades of the Smead Value Fund
7.  Wells Fargo Bank, N.A. -- Moody's affirms Wells Fargo Bank, N.A.'s SQ assessments
8.  Sharecare Lands SPAC Deal, Launches Digital Vaccine Assistant For Partners: What Investors Should Know
9.  Yahoo Finance to stream Daily Journal annual meeting featuring Charlie Munger
10.  Yahoo Finance to stream Daily Journal annual meeting featuring Charlie Munger
11.  JPMorgan (JPM) Breaks Out to All-Ti

In [12]:
wfc_news_df["date"] = pd.to_datetime(wfc_news_df["date"], errors='coerce', format='%Y-%m-%d %H:%M:%S', utc= True)
wfc_news_df["date"] = wfc_news_df["date"].dt.strftime('%Y/%m/%d')
wfc_news_df

                                                 title        date ticker
0    As LIBOR fades away, alternative rates get a c...  2021/12/31    WFC
1                RBC Capital Markets Joins DirectBooks  2021/02/10    WFC
2    Wells Fargo &amp; Company Announces Partial Re...  2021/02/10    WFC
3    Wells Fargo &amp; Company Announces Full Redem...  2021/02/10    WFC
4    Wells Fargo &amp; Company Announces Full Redem...  2021/02/10    WFC
..                                                 ...         ...    ...
995  Wells Fargo (WFC) Gains But Lags Market: What ...  2021/12/29    WFC
996   Will Wells Fargo's Asset Cap Be Removed in 2022?  2021/12/30    WFC
997  This bank catapulted its Colorado presence in ...  2021/12/30    WFC
998  Libor era nears end, firms move to SOFR for fi...  2021/12/30    WFC
999  Mortgage rates: The Fed ‘will push rates highe...  2021/12/30    WFC

[1000 rows x 3 columns]


In [13]:
# wfc_news_df.to_parquet("s3://project-2023-datalake/raw/eodhd/wfc_2021.parquet")

write_aws_table(
    wfc_news_df, 
    "s3://project-2023-datalake/raw/eodhd/wfc_2021/", 
    glueContext, 
    spark, 
    "project", 
    "raw_wfc_2021_news"
)




## Noticias de Target 2021

In [14]:
tgt_news_2021 = get_customized_news('TGT', '2021-01-01', '2021-12-31', EOD_API_KEY, 1000, 0)
tgt_news_2021_df = set_dataframe_from_eod_api(tgt_news_2021, 'TGT')
tgt_news_2021_df["ticker"] = "TGT"
tgt_news_2021_df

1.  Kroger (KR) Boosts Shareholder Returns With $1B Buyback Plan
2.  Target, Canada Goose, Boyd Gaming, MGM Resorts and Penn National Gaming highlighted as Zacks Bull and Bear of the Day
3.  Target, Canada Goose, Boyd Gaming, MGM Resorts and Penn National Gaming highlighted as Zacks Bull and Bear of the Day
4.  Is Target Unstoppable After Another Big Earnings Beat?
5.  Online Grocery Sales Jump in April: 4 Solid Stocks to Buy
6.  Online Grocery Sales Jump in April: 4 Solid Stocks to Buy
7.  What Target Is Doing Right
8.  Looking for Earnings Beat? Play These 5 Stocks
9.  The Zacks Analyst Blog Highlights: Target, Walmart, J &amp; J Snack Foods and Hain Celestial
10.  The Zacks Analyst Blog Highlights: Target, Walmart, J &amp; J Snack Foods and Hain Celestial
11.  Top Ranked Growth Stocks to Buy for May 27th
12.  Top Ranked Growth Stocks to Buy for May 27th
13.  Dick's Sporting Goods gets serious about golf boom
14.  Dick's Sporting Goods gets serious about golf boom
15.  This dollar st

In [15]:
# tgt_news_2021_df.to_parquet("s3://project-2023-datalake/raw/eodhd/tdt_2021.parquet")

write_aws_table(
    tgt_news_2021_df, 
    "s3://project-2023-datalake/raw/eodhd/tgt_2021/", 
    glueContext, 
    spark, 
    "project", 
    "raw_tgt_2021_news"
)




## Noticias de Target 2020

In [16]:
tgt_news_2020 = get_customized_news('TGT', '2019-12-31', '2021-01-01', EOD_API_KEY_TRAVEL_EMAIL, 1000, 0)
tgt_news_2020_df = set_dataframe_from_eod_api(tgt_news_2020, 'TGT')
tgt_news_2020_df

1.  J.C. Penney is toast in 2021
2.  'Tech companies will do well' amid coronavirus pandemic: expert
3.  Don’t Blame Amazon for Cleaning Up in Lockdown
4.  FOCUS-Ocean shipping shrinks as pandemic pummels retailers
5.  This Trio of Companies Just Raised Dividends
6.  Benzinga's Top Upgrades, Downgrades For August 14, 2020
7.  Target and Bridgestone highlighted as Zacks Bull and Bear of the Day
8.  Target and Bridgestone highlighted as Zacks Bull and Bear of the Day
9.  UPS to hire 100,000 seasonal workers for extended holiday shopping rush
10.  Bloomin' Brands, Mesa Laboratories, Intuit, Target and Microsoft highlighted as Zacks Bull and Bear of the Day
11.  Bloomin' Brands, Mesa Laboratories, Intuit, Target and Microsoft highlighted as Zacks Bull and Bear of the Day
12.  Tata Group Courts Investors for New Digital Platform
13.  Tata Group Courts Investors for New Digital Platform
14.  Dow Jones Falls 200 Points, As Apple, Tesla Stumble; Beyond Meat Soars 12% On Expanded Walmart Partne

In [17]:
# tgt_news_2020_df.to_parquet("s3://project-2023-datalake/raw/eodhd/tgt_2020.parquet")

write_aws_table(
    tgt_news_2020_df, 
    "s3://project-2023-datalake/raw/eodhd/tgt_2020/", 
    glueContext, 
    spark, 
    "project", 
    "raw_tgt_2020_news"
)




## Noticias de Wells Fargo 2020

In [18]:
wfc_news_2020 = get_customized_news('WFC', '2019-12-31', '2021-01-01', EOD_API_KEY_TRAVEL_EMAIL, 1000, 0)
wfc_news_2020_df = set_dataframe_from_eod_api(wfc_news_2020, 'WFC')
wfc_news_2020_df

1.  Market Recap: Wednesday, December 23
2.  Wells Fargo Foundation and NFWF Announce Release of the Resilient Communities Program 2020 Request for Proposals
3.  No Barriers Summit Announces Location for 2020
4.  Wells Fargo Selects Workday to Help Transform HR
5.  No Barriers, Wells Fargo Select 2019 Global Impact Challenge Winners
6.  Stocks - Europe to Edge Higher Amid Virus Peaking Hopes
7.  Top 5 Buys of the Yacktman Focused Fund
8.  Investor Alert: Kaplan Fox Announces Investigation of  Wells Fargo & Company
9.  SHAREHOLDER ALERT: Lowey Dannenberg Investigates Claims on Behalf of Investors of Wells Fargo & Company (WFC)
10.  No Barriers Summit 2020 Goes Virtual This June
11.  WELLS FARGO INVESTOR ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $100,000 Investing In Wells Fargo & Company To Contact The Firm
12.  Pomerantz Law Firm Announces the Filing of a Class Action against Wells Fargo & Company and Certain Officers – WFC
13.  Portnoy Law: Lawsuit

In [19]:
# wfc_news_2020_df.to_parquet("s3://project-2023-datalake/raw/eodhd/wfc_2020.parquet")

write_aws_table(
    wfc_news_2020_df, 
    "s3://project-2023-datalake/raw/eodhd/wfc_2020/", 
    glueContext, 
    spark, 
    "project", 
    "raw_wfc_2020_news"
)




In [None]:
# from platform import python_version

# print(python_version())