## Inspect each chunk for accuracy of "crime" label

#### Import library

In [1]:
import os
import pandas as pd
pd.set_option('display.max_colwidth', 150)

#### Manual Inspection of Topic Model Accuracy
- Call each file saved during Chunk-processing (03_TopicModelingAllBatches.ipynb), saved in `data/interim`  
- Take a glance at "title" and "article" to see if they are really CRIME-related news
    - Read "title" and "article" of labeled news in `data/interim` folder
    - Mark non-relevant filenames for next step

In [2]:
os.getcwd()

'/Users/jhonsen/Documents/DS/dsProjects/racial-bias-detection/notebooks'

In [3]:
filepath = os.path.join('..','data', 'interim', 'crime_topic_index.gzip')
topic_index = pd.read_parquet(filepath)
topic_index.iloc[[*range(3)],:]

Unnamed: 0,filename,topic,start_row,end_row
0,labeled_crime_row1_to_row20000.gzip,12,1,20000
1,labeled_crime_row20001_to_row40000.gzip,5,20001,40000
2,labeled_crime_row40001_to_row60000.gzip,13,40001,60000


In [4]:
# Total number of files to inspect
topic_index.shape[0]

135

In [298]:
total = topic_index.shape[0]

def view(start, end):

    # display setting
    display_row = 3  # look at only the first 3 rows
    colnames = ['topic','title','article']
    selected_chunk = topic_index.iloc[[*range(start,end)],:]

    for index, row in selected_chunk.iterrows():

        if row['topic'] != 99:  # 99 is a dummy topic

            # get title and articles
            start,end,topic = row['start_row'], row['end_row'], row['topic']
            filename = f'labeled_crime_row{start}_to_row{end}.gzip'

            article = pd.read_parquet(os.path.join('..','data', 'interim', filename), 
                                      engine="pyarrow").query(f'topic=={topic}').head(display_row)[colnames]        
            select_topic_index = topic_index[(topic_index['start_row']==start) & (topic_index['end_row']==end)][['filename','topic']]

            display(pd.merge(select_topic_index, article, how='inner', on='topic'))

def data_batch(total):
    for start in range(0, total, 4): 
        end = start + 4
        try:
            print(start, end)
            yield view(start, end)
        except IndexError:
            end = total
            view(start, end)

# Initialize
go = data_batch(total)

During inspection, record filenames of potential non-crime/violence news

In [297]:
# Iterate over 135 total files
next(go)

132 136


StopIteration: 

In [135]:
# Insert the filenames for which title/article dont appear to be crime related

non_crime_filenames = [
                       'labeled_crime_row1520001_to_row1540000.gzip',
                       'labeled_crime_row2460001_to_row2480000.gzip',
                      ]

---

##### Inspect positive relevant crime news

In [155]:
files = os.listdir(os.path.join('..','data', 'interim'))
crime_files = [f for f in files if (f not in non_crime_filenames) & ('labeled_crime_row' in f)]


In [156]:
fname = os.path.join('..','data', 'interim', crime_files[0])
df = pd.read_parquet(fname, engine="pyarrow")
display(df[['title','article']].head(3))
display(df[['title','article']].sample(3))

Unnamed: 0,title,article
600002,Lake Bell Opens Up About How Husband Scott Campbell Changed Her View of Love &amp; Marriage,I Do…Until I Don’t Is Currently In Theaters Come back every day at 8:30 a.m. EST to watch People Now streaming live from Time Inc. headquarters in...
600015,How Top Gun Star Kelly McGillis Survived Sexual Assaults,"Happy Birthday, Kelly McGillis! The Top Gun actress turned 61 on Monday. Her birthday arrives as Tom Cruise continues to shoot the long-anticipat..."
600032,"How 2 'Heroes' Helped Police Rescue Abducted Texas Girl, 8","When Jeff King heard that his old friend’s 8-year-old daughter had been grabbed on the street while walking with her mother Saturday night, he jum..."


Unnamed: 0,title,article
612883,Webcam slavery: tech turns Filipino families into cybersex child traffickers,MANILA (Thomson Reuters Foundation) - It was the half-naked girls running from room to room upon her arrival that made Filipina teenager Ruby fear...
619474,Father Says He Expects Teens to Go Out in ‘Blaze of Glory’ as They Are Charged With Third Murder,This article originally appeared on VICE Canada. Police have identified the third person allegedly murdered by two teens on the run as a universi...
603873,6-Year-Old Girl Dies After Younger Brother Shoots Her Accidentally,A 6-year-old girl is dead after her younger brother accidentally shot her in the head with a loaded handgun he found in the console of the family’...


##### Inspect potentially non-relevant crime/violence news

In [148]:
df = pd.read_parquet(os.path.join('..','data', 'interim', non_crime_filenames[1]), engine="pyarrow")
display(df[['title','article']].head(3))
display(df[['title','article']].sample(3))

Unnamed: 0,title,article
2460338,Rep. Duncan Hunter resigns from Congress,"Rep. Duncan Hunter will officially step down from Congress next week, more than a month after the California Republican pleaded guilty to conspira..."
2460366,Ruth Bader Ginsburg says she is cancer-free,"Ruth Bader Ginsburg remains clear of cancer, the Supreme Court justice told CNN this week. “I’m cancer free. That’s good,” she told the outlet in ..."
2460370,Appeals court lifts block on $3.6 billion for Trump border wall plan,A divided federal appeals court has lifted a lower court’s order blocking $3.6 billion in military construction funds that President Donald Trump ...


Unnamed: 0,title,article
2467522,U.S. FAA seeks to fine Boeing $5.4 million for defective parts on 737 MAX planes,"WASHINGTON, Jan 10 (Reuters) - The Federal Aviation Administration (FAA) said on Friday it is seeking to fine Boeing Co $5.4 million, alleging it ..."
2477020,Weinstein judge warns jurors against #MeToo 'referendum',"Judge James Burke told prospective jurors on Thursday not to view Harvey Weinstein&aposs criminal trial as a ""referendum on the #MeToo movement"" o..."
2478642,How gangs 'effectively control' the prisons in Honduras,Editor&aposs Note: This article is part of an ongoing series on prison conditions and criminal justice policy around the world.Dozens of inmates w...
