# This notebook converts the AG News dataset into a format that can be used by Comprehend for custom classification.

## Install and import libraries

In [1]:
!pip install --upgrade  s3fs pandas  tqdm

Collecting s3fs
  Downloading s3fs-2022.1.0-py3-none-any.whl (25 kB)
Collecting tqdm
  Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 1.1 MB/s             
Collecting fsspec==2022.01.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 75.2 MB/s            
[?25hCollecting aiobotocore~=2.1.0
  Downloading aiobotocore-2.1.2.tar.gz (58 kB)
     |████████████████████████████████| 58 kB 8.5 MB/s             
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting importlib-resources
  Downloading importlib_resources-5.4.0-py3-none-any.whl (28 kB)
Collecting botocore<1.23.25,>=1.23.24
  Downloading botocore-1.23.24-py3-none-any.whl (8.4 MB)
     |████████████████████████████████| 8.4 MB 30.6 MB/s            
Building wheels for collected packages: aiobotocore
  Building wheel for aiobotocore (setup.py) ... [?25ldone
[?25h  Created wheel for aiobotocore: filename=aiobotocore-2.1

In [2]:
import pandas as pd
import tqdm
import boto3
region_name='us-east-1'
import matplotlib

## Get our data.  Our data lives in the Amazon S3 open datasets.  Many times, you can stream data right from S3 without downloading.
## In this case, since its small and in a tar file, lets download and look at it.

### The messages from perssions in the untar operation can be ignored.

In [3]:
! wget https://s3.amazonaws.com/fast-ai-nlp/ag_news_csv.tgz
! tar xvzf ag_news_csv.tgz

--2022-05-25 13:11:21--  https://s3.amazonaws.com/fast-ai-nlp/ag_news_csv.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.204.96
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.204.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11784419 (11M) [application/x-tar]
Saving to: ‘ag_news_csv.tgz’


2022-05-25 13:11:22 (21.7 MB/s) - ‘ag_news_csv.tgz’ saved [11784419/11784419]

ag_news_csv/
ag_news_csv/train.csv
ag_news_csv/readme.txt
ag_news_csv/test.csv
ag_news_csv/classes.txt


#### Read in the files in to "Pandas to see what is happening"

In [4]:
train=pd.read_csv("ag_news_csv/train.csv", names=['category','title','text'])

#### This is our training dataset.  it has 3 columns, a label, title and text.

In [5]:
train

Unnamed: 0,category,title,text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."
...,...,...,...
119995,1,Pakistan's Musharraf Says Won't Quit as Army C...,KARACHI (Reuters) - Pakistani President Perve...
119996,2,Renteria signing a top-shelf deal,Red Sox general manager Theo Epstein acknowled...
119997,2,Saban not going to Dolphins yet,The Miami Dolphins will put their courtship of...
119998,2,Today's NFL games,PITTSBURGH at NY GIANTS Time: 1:30 p.m. Line: ...


#### To reduce the training time to a reasonable amount for the excercise, we'll limit the data to just 1000 rows.

In [6]:
train = train.sample(axis='index',n=1000,random_state=100)

#### In order to make things prettier, let's change our labels from a number to a string.  The dataset provider told us what the data looks like in the classes.txt file

In [7]:
labeldict={'1': 'WORLD', '2' :  'SPORTS', '3' : 'BUSINESS', '4': 'SCI_TECH'}
trainstr=train.astype(str)
trainstr['label']=trainstr['category'].replace(labeldict)

#### Put the title and the text in one column for our training.  Normally this might be the result of some experimentation on our data.  But it is generally the best practice to start to give a text classifier "all" relevant data to start.

#### Now, only write out our label and text, because that's what Comprehend expects as input.

In [8]:
dfout=trainstr[["label", 'text']]  

In [9]:
dfout

Unnamed: 0,label,text
68388,WORLD,"PATTANI, Thailand, December 5 (IslamOnline.net..."
98155,SCI_TECH,America Online and WebEx Communications are ev...
100520,BUSINESS,Motorists coasted through Pennsylvania Turnpik...
119795,SPORTS,It no longer matters that Montana #39;s drive ...
63040,WORLD,"For years, Israel has feuded with the United N..."
...,...,...
85369,SPORTS,CBS and Fox yesterday announced an \$8 million...
16033,SPORTS,"AP - The hits and runs kept coming, spinning b..."
77218,WORLD,AFP - Seven artists from the new eastern membe...
43924,BUSINESS,A smart investor pays \$1.25 billion for Orbit...


#### Let's look at a quick histogram and see what our labels look like.  They are very balanced.

In [10]:
dfout['label'].value_counts()

BUSINESS    264
SPORTS      262
WORLD       249
SCI_TECH    225
Name: label, dtype: int64

### Copy the data to an S3 bucket


In [11]:
# Get the account ID from STS so we can all have unique bucket names
client = boto3.client("sts")
account_id = client.get_caller_identity()["Account"]
bucket_name = "comprehend-labs" + account_id +  "-2"
print ("Bucket name used is " + bucket_name )

Bucket name used is comprehend-labs348052051973-2




In [12]:
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')

if (s3.Bucket(bucket_name).creation_date is None):
    #location = {'LocationConstraint': region_name}
    s3_client.create_bucket(Bucket=bucket_name)#, CreateBucketConfiguration=location)
    print ("Created bucket " + bucket_name)
else:
    print ("Bucket Exists")

Created bucket comprehend-labs348052051973-2


In [13]:
file_name="s3://" + bucket_name + "/custom_news_classification.csv"

In [14]:
dfout.to_csv(file_name, header=False, index=False )

### Copy the below to Comprehend to use for a classifier!

In [15]:
print(file_name)

s3://comprehend-labs348052051973-2/custom_news_classification.csv
