# Kinesis Data Stream
* https://github.com/aws-samples/aws-ml-data-lake-workshop
* https://aws.amazon.com/blogs/big-data/snakes-in-the-stream-feeding-and-eating-amazon-kinesis-streams-with-python/

![Kinesis Data Stream](img/kinesis_data_stream_docs.png)

In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)
kinesis = boto3.Session().client(service_name='kinesis', region_name=region)

In [2]:
%store -r stream_name

In [3]:
# TODO:  Adapt to any number of shards

shard_id_1 = kinesis.list_shards(StreamName=stream_name)['Shards'][0]['ShardId']
print(shard_id_1)

shard_id_2 = kinesis.list_shards(StreamName=stream_name)['Shards'][1]['ShardId']
print(shard_id_2)

shardId-000000000000
shardId-000000000001


# Download Dataset

In [4]:
!aws s3 cp 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz' ./data/

download: s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz to data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz


In [5]:
import csv
import pandas as pd

df = pd.read_csv('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', 
                 delimiter='\t', 
                 quoting=csv.QUOTE_NONE,
                 compression='gzip')
df.shape

(102084, 15)

In [6]:
df.head(5)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,17747349,R2EI7QLPK4LF7U,B00U7LCE6A,106182406,CCleaner Free [Download],Digital_Software,4,0,0,N,Y,Four Stars,So far so good,2015-08-31
1,US,10956619,R1W5OMFK1Q3I3O,B00HRJMOM4,162269768,ResumeMaker Professional Deluxe 18,Digital_Software,3,0,0,N,Y,Three Stars,Needs a little more work.....,2015-08-31
2,US,13132245,RPZWSYWRP92GI,B00P31G9PQ,831433899,Amazon Drive Desktop [PC],Digital_Software,1,1,2,N,Y,One Star,Please cancel.,2015-08-31
3,US,35717248,R2WQWM04XHD9US,B00FGDEPDY,991059534,Norton Internet Security 1 User 3 Licenses,Digital_Software,5,0,0,N,Y,Works as Expected!,Works as Expected!,2015-08-31
4,US,17710652,R1WSPK2RA2PDEF,B00FZ0FK0U,574904556,SecureAnywhere Intermet Security Complete 5 De...,Digital_Software,4,1,2,N,Y,Great antivirus. Worthless customer support,I've had Webroot for a few years. It expired a...,2015-08-31


In [7]:
df_star_rating_and_review_body = df[['star_rating', 'review_body']][:100]
df_star_rating_and_review_body.shape

(100, 2)

In [8]:
df_star_rating_and_review_body.head()

Unnamed: 0,star_rating,review_body
0,4,So far so good
1,3,Needs a little more work.....
2,1,Please cancel.
3,5,Works as Expected!
4,4,I've had Webroot for a few years. It expired a...


In [9]:
reviews_tsv = df_star_rating_and_review_body.to_csv(sep='\t',
                                                    header=None,
                                                    index=False)

In [10]:
reviews_tsv



# Simulate Application Writing Records to the Stream

In [11]:
data_stream_response = kinesis.describe_stream(
    StreamName=stream_name
)

print(data_stream_response)

{'StreamDescription': {'StreamName': 'dsoaws-data-stream', 'StreamARN': 'arn:aws:kinesis:us-west-2:250107111215:stream/dsoaws-data-stream', 'StreamStatus': 'ACTIVE', 'Shards': [{'ShardId': 'shardId-000000000000', 'HashKeyRange': {'StartingHashKey': '0', 'EndingHashKey': '170141183460469231731687303715884105727'}, 'SequenceNumberRange': {'StartingSequenceNumber': '49610091589228913449896360939155411114010545354140811266'}}, {'ShardId': 'shardId-000000000001', 'HashKeyRange': {'StartingHashKey': '170141183460469231731687303715884105728', 'EndingHashKey': '340282366920938463463374607431768211455'}, 'SequenceNumberRange': {'StartingSequenceNumber': '49610091589251214195094891562296946832283193715646791698'}}], 'HasMoreShards': False, 'RetentionPeriodHours': 24, 'StreamCreationTimestamp': datetime.datetime(2020, 8, 22, 22, 54, 9, tzinfo=tzlocal()), 'EnhancedMonitoring': [{'ShardLevelMetrics': []}], 'EncryptionType': 'NONE'}, 'ResponseMetadata': {'RequestId': 'c95ef119-4ca4-6b56-9ee5-55a300f

In [12]:
partition_key = 'CAFEPERSON'

In [13]:
data_stream = boto3.Session().client(service_name='kinesis', region_name=region)

response = data_stream.put_records(
    Records=[
        {
            'Data': reviews_tsv.encode('utf-8'),
            'PartitionKey': partition_key
        },
    ],
    StreamName=stream_name
)

# Store Variables for the Next Notebooks

In [14]:
%store partition_key

Stored 'partition_key' (str)


# Get Records

In [15]:
# TODO:  Adapt to any number of shards

shard_id_1 = 'shardId-000000000000'
shard_id_2 = 'shardId-000000000001'

In [16]:
# TODO:  Adapt to any number of shards

shard_iter_1 = data_stream.get_shard_iterator(StreamName=stream_name, 
                                            ShardId=shard_id_1, 
                                            ShardIteratorType='TRIM_HORIZON')['ShardIterator']

shard_iter_2 = data_stream.get_shard_iterator(StreamName=stream_name, 
                                            ShardId=shard_id_2, 
                                            ShardIteratorType='TRIM_HORIZON')['ShardIterator']

In [17]:
records_response_1 = data_stream.get_records(
    ShardIterator=shard_iter_1,
    Limit=100
)

if records_response_1['Records']:
    print(records_response_1['Records'][0]['Data'].decode('utf-8'))

In [18]:
records_response_2 = data_stream.get_records(
    ShardIterator=shard_iter_2,
    Limit=100
)

if records_response_2['Records']:
    print(records_response_2['Records'][0]['Data'].decode('utf-8'))

4	So far so good
3	Needs a little more work.....
1	Please cancel.
5	Works as Expected!
4	I've had Webroot for a few years. It expired and I decided to purchase a renewal on Amazon. I went through hell trying to uninstall the expired version in order to install the new.  I called Webroot and had their representative remote into my computer at his request. He was clueless as a bad joke and consumed 29 minutes and 57 seconds of my time forever.  He initially told me it wasn't compatible with Windows 10, but I finally managed to convince him that it is indeed compatible with Windows 10 as it was working on my computer before it expired and also I showed him a review on Amazon to convince him that it works on Windows 10. Finally, he offered to connect me with a senior consultant for over 100 dollars. I declined and told him I'd fix the issue myself. This guy was less helpful than a severed limb.  After spending some time on Google, the issue is now fixed. Webroot should just get rid of thei

In [19]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/kinesis/home?region={}#/streams/details/{}/monitoring"> Stream</a></b>'.format(region, stream_name)))


In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();

<IPython.core.display.Javascript object>