# Objective

Try to see whether the rows with age > 40 and date = 2015-02-02 have any IDs that map to the IDs available in the jsonl files so we can check the age.

# Takeaways

In the first million lines of the tsv file, none of the thirteen rows with age > 40 and with date = 2015-02-02 had IDs that map to any of the 146,842 IDs available in the jsonl files that are readable (0001.jsonl, 0002.jsonl, and 0003.jsonl).

My computer was unable to unzip 0000.jsonl.gz.

Also, looking at the extracted text feature, there wasn't much more I could really see to be extracted.

# Next step

Try to read 0000.jsonl contents another way.

Read more than one million lines of the tsv file.


In [2]:
import csv, re, jsonlines
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter

### Exploring the JSONL files

In [194]:
fname = '0001.jsonl'
with jsonlines.open(fname) as f:
    data = [line for line in f.iter()]
df = pd.DataFrame(data)

In [72]:
df['extracted_text'][0]

'\n\n\n\n\n\n\nLive Escort Reviews - 415-745-4689 - .Sexy BLONDE!!(( Your PlEASURE Is My PURPOSE)) - 19\n\n\n \n\n\n\n \n\n\n\n\n\n\n\n\n\n Login /\n Register\n\n\n\n\nSacramento, CA\n\n\n\n\n  \n\n\n\nEscort Ads\n |\n\nBody Rub Ads\n |\n\nCam Models\n |\n\nFilter Fakes\n |\n\nReviewed Ads\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\nPin Ad to Gallery\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n.Sexy BLONDE!!(( Your PlEASURE Is My PURPOSE)) - 19\n\n\n Sacramento East, Yuba City/Marysville | Friday, January 2, 2015 10:44 AM | •\n415-745-4689\n\n\n \n\n\n\n\nNo TER review found\n\n\n\n\nWrite a review\n\n\n \n\n\n\n\nMore ads and images with this phone number:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n\n\n\nBackpage Link\n\n\n\n\n\nBack to Gallery\n\n\n\n[ Report Ad ]\n\n\n\n  \n\n  \n\n\n\n  \n\n\nAll Cities |\nEscort Ads |\nReviewed Ads |\nContact |\nAbout\n\n\n\nCopyright @ 2014 LiveEscortReviews.com\n\n\n\n  \n \n\n \n\n\n\n\n\n\n\n'

In [73]:
list(df['doc_id'])[0]

'3BC13826C2F861DF36F03B4B58D290BEC9FBB1DE2CB710D2505283D0D1EBF2EC'

# Investigating ethnicity and age in the TSV

Just looking at the first half million entries for now (since that's about as much as my Macbook Pro can handle locally).

In [165]:
df_tsv = pd.read_csv("extractions_02_19_2020.tsv", 
                 sep='\t', 
                 nrows=1000000)

In [166]:
df_2015_02_02 = df_tsv.loc[df_tsv['date'] == '2015-02-02']

In [167]:
ages_all = []
num_unknowns = 0
for entry in list(df_2015_02_02['age']):
    if type(entry) is str:
        ages_all.append(int(entry[2:-2]))
    else:
        ages_all.append(0)

In [168]:
ids_all = []
for entry in list(df_2015_02_02['id ']):
    ids_all.append(entry[:-1])

In [169]:
confidences_all = []
for entry in list(df_2015_02_02['location-fonduer']):
    if type(entry) is str and len(entry) > 2:
        confidences_all.append(int(entry[-2]))
    else:
        confidences_all.append(0)

In [170]:
df_clean = pd.DataFrame(
    {'ID': ids_all,
     'Age': ages_all,
     'Confidence': confidences_all})

In [179]:
df_older = df_clean[df_clean.Age > 40]

In [180]:
len(df_older)

13

In [181]:
df_older

Unnamed: 0,ID,Age,Confidence
9,A86AF4CA0468A1A2B677A4882B610783A390E48A16BC08...,42,0
41,6AC58620DF74663409876DDA81D499762037504290E840...,69,0
159,C3FA32DEC1E6FD2ADA72AE806F20727DC33781D77BFF01...,44,1
218,55AD5E19DF167FA7D83144686BA51C6170DAAE0DF37031...,44,1
228,FAB1ADB96A3E582498C5EE07825B1223BC4EDCE84B18D1...,49,0
242,7129D9FA5FFFC3232210C1F4FA3F45259B2FA01D0C369D...,98,1
265,93DDD3754AADEBBF1C7164F12268943749EE77E84880E6...,44,1
344,DFA674E338C92214F8C030A20FF9811B1765A3903E3528...,43,1
414,FA6958AC53BF92802817F3E2474754553BDECB17A1AD40...,99,1
421,B5E5649779E577EFF7BFC8B659E167B731413672D42403...,47,0


In [199]:
available_ids = []
fnames = ['0001.jsonl', '0002.jsonl', '0003.jsonl']
for fname in fnames:
    with jsonlines.open(fname) as f:
        data = [line for line in f.iter()]
    df = pd.DataFrame(data)
    available_ids += list(df['doc_id'])

In [198]:
len(available_ids)

146843


In [182]:
for ind in df_older.index:
    print(df_older['ID'][ind] in available_ids)

False
False
False
False
False
False
False
False
False
False
False
False
False


In [187]:
available_ids = []
fnames = ['0001.jsonl', '0002.jsonl', '0003.jsonl']
for fname in fnames:
    with jsonlines.open(fname) as f:
        data = [line for line in f.iter()]
    df = pd.DataFrame(data)
    available_ids += list(df['doc_id'])