# HW01: Intro to Text Data

In this assignment, we will explore how to load a text classification dataset (AG's news, originally posted [here](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)), how we can preprocess the data and extract useful information from a real-world dataset. First, we have to download the data; we only download a subset of the data with four classes.

In [None]:
!curl -O https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

# used curl instead of original wget because it didn't work and chatgpt said to try this

## Inspect Data

In [1]:
import pandas as pd
df = pd.read_csv("train.csv", header=None)
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   0       120000 non-null  int64 
 1   1       120000 non-null  object
 2   2       120000 non-null  object
dtypes: int64(1), object(2)
memory usage: 2.7+ MB


Unnamed: 0,0,1,2
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


Let's make the data more human readable by adding a header and replacing labels

In [2]:
df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 

In [3]:
df.head()

Unnamed: 0,label,title,lead
0,business,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,business,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,business,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,business,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,business,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [3]:
# TODO implement a new column text which contains the lowercased title and lead (concatenated with space)
df["text"] = df["title"].str.lower() + " " + df["lead"].str.lower()

In [4]:
# TODO print the number of documents for each label
from collections import Counter

# Counts the frequency of each unique item in some iterable
Counter(df["label"])


Counter({'business': 30000, 'sci/tech': 30000, 'sport': 30000, 'world': 30000})

## Document Length

In [None]:
# TODO create a new column with the number of non-stop words in each text
# TODO plot the average number of non-stop words per label 

In [5]:
# I cannot think of another way to do it, so I will install nltk and use a stopword list from there
import nltk
# nltk.download('stopwords')
# nltk.download("punkt")

In [6]:
# First create a new column with stopword count
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

sw_list = set(stopwords.words("english"))

# create new column for counting non-stopwords
nsw_occurences = list()

for index, row in df.iterrows():
    tokens = word_tokenize(row["text"])
    stops = sum(1 for word in tokens if word in sw_list)
    nsw_occurences.append((len(tokens) - stops))

df.insert(len(df.columns), "non-stopword counts", nsw_occurences, True)


In [9]:
# Now calculate the average number of non-stopwords per label
df.groupby('label')['non-stopword counts'].mean()

label
business    33.409867
sci/tech    31.777267
sport       30.611833
world       31.975767
Name: non-stopword counts, dtype: float64

## Word Frequency 

Let's implement a keyword search (similar to the baker-bloom economic uncertainty) and compute how often some given keywords ("play", "tax", "blackberry", "israel") and numbers appear in the different classes in our data

In [22]:
import re
keywords = ["play", "tax", "blackberry", "israel"]
for keyword in keywords:
    #TODO implement a regex pattern for keyword
    x = "\b{}\b".format(keyword)
    pattern = re.compile(x)
    def count_keyword_frequencies(x):
        #TODO implement a function which counts how often a pattern appears in a text
        num_occurrences = len(re.findall(pattern, x))
        return num_occurrences
    # Now, we can print how often a keyword appears in the data
    print (df["text"].apply(count_keyword_frequencies).sum())
    # and we want to find out how often the keyword appears withhin each class
    for label in df["label"].unique():
        print ("label:", label,", keyword:", keyword)
        #TODO print how often the keyword appears in this class
        print(df.loc[df['label'] == label]['text'].apply(count_keyword_frequencies).sum())
    print ("*" * 100)

0
label: business , keyword: play
0
label: sci/tech , keyword: play
0
label: sport , keyword: play
0
label: world , keyword: play
0
****************************************************************************************************
0
label: business , keyword: tax
0
label: sci/tech , keyword: tax
0
label: sport , keyword: tax
0
label: world , keyword: tax
0
****************************************************************************************************
0
label: business , keyword: blackberry
0
label: sci/tech , keyword: blackberry
0
label: sport , keyword: blackberry
0
label: world , keyword: blackberry
0
****************************************************************************************************
0
label: business , keyword: israel
0
label: sci/tech , keyword: israel
0
label: sport , keyword: israel
0
label: world , keyword: israel
0
****************************************************************************************************


As a last exercise, we re-use the fuzzy keyword search implemented above and plot the total number of occurrences of "tax" (and it's variations, e.g. taxation, taxes etc.) for each class in the dataset. Hint: have a look at the [pandas bar plot with group by](https://queirozf.com/entries/pandas-dataframe-plot-examples-with-matplotlib-pyplot)

In [None]:
import matplotlib.pyplot as plt

keyword = "tax"
pattern = re.compile(...)

def count_keyword_frequencies(x):
    #TODO implement a function which counts the total number of the word "tax" (and other fuzzy matches of tax) appearing in a given text

df["counts"] = df["text"].apply(count_keyword_frequencies)
#TODO create a bar plot for the wordcounts of "tax" for each class in the dataset

In [20]:
import os

os.system('jupyter nbconvert --to html homework_01.ipynb')

0

In [21]:
!open homework_01.html

'open' is not recognized as an internal or external command,


operable program or batch file.
