# Label and Clean Records

Records are labeled manually according to the following categories:
 * `0` - default value
 * `1` - related to autonomous cras topic
 * `2` - not related to autonomous cars topic
 * `3` - error message or empty record
 * `4` - duplicates

**Import libraries:**

In [1]:
import pandas as pd

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed')).History will not be written to the database.


**Read in file with manually labeled records (note, only subsample of first 800 records is labeled, other have default value of `0`):**

In [2]:
df = pd.read_csv("../data/processed/20191209233601.19044.gkg.Labeled.txt", sep='\t', index_col=0)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4089 entries, 0 to 4088
Data columns (total 13 columns):
Date             4089 non-null int64
NumArticles      4089 non-null int64
Counts           586 non-null object
Themes           4001 non-null object
Locations        4089 non-null object
Persons          3578 non-null object
Organizations    3771 non-null object
ToneData         4089 non-null object
CAMEOEvents      2222 non-null object
Sources          4089 non-null object
SourceURLs       4089 non-null object
text             3651 non-null object
label            4089 non-null int64
dtypes: int64(3), object(10)
memory usage: 607.2+ KB


**Find all dummy records from `google1299.com` and replace them with empty text:**

In [31]:
df[ df['text'].astype(str).str.contains('google1299.com') ].shape

(131, 13)

In [32]:
df.loc[ df['text'].astype(str).str.contains('google1299.com'), 'text'] = str('')

**Find all records that didn't go through proxy setver and replace them with empty text:**

In [39]:
df[ df['text'].astype(str).str.contains('edc.nam.gm.com') ].shape

(42, 13)

In [40]:
df.loc[ df['text'].astype(str).str.contains('edc.nam.gm.com'), 'text'] = str('')

**Find all records with `nan` content and replace them with empty text:**

In [44]:
df[ df['text'].astype(str) == 'nan' ].shape

(438, 13)

In [45]:
df.loc[ (df['text'].astype(str) == 'nan'), 'text'] = str('')

**Find all empty records and label them as `3`:**

In [46]:
df[ df['text'].astype(str) == '' ].shape

(611, 13)

In [47]:
df.loc[ df['text'].astype(str) == '', 'label'] = 3

**Find all duplicates among non-empty records and label them with `4`:**

In [49]:
df[ df['text'].astype(str) != '' ].shape

(3478, 13)

In [63]:
df['text'].duplicated(keep='first').shape

(4089,)

In [75]:
df[ df['text'].duplicated(keep='first') ].shape

(1263, 13)

In [77]:
df[ ((df['text'].astype(str) != '') & (df['text'].duplicated(keep='first'))) ].shape

(653, 13)

In [64]:
df.loc[ ((df['text'].astype(str) != '') & (df['text'].duplicated(keep='first'))), 'label'] = 4

In [78]:
df[ ((df['text'].astype(str) != '') & (df['text'].duplicated(keep='first'))) ][['text','label']]

Unnamed: 0,text,label
81,Let friends in your social network know what y...,4
98,BEIJING - In the week prior to the Chinese Lun...,4
113,The joint venture comprises Toyota and two of ...,4
141,NewsFactor | CIO Today | Top Tech News | Sci-T...,4
148,Copyright ©2019 ACN Newswire. All rights reser...,4
...,...,...
4073,Arizona Gov. Doug Ducey has sent Uber Technolo...,4
4074,Let friends in your social network know what y...,4
4075,Automakers will only be responsible in the cas...,4
4076,The page you were looking for may have been ou...,4


**Finally, let's count records for each label:**

In [79]:
df[ df['label'] == 0 ].shape

(2251, 13)

In [80]:
df[ df['label'] == 1 ].shape

(288, 13)

In [81]:
df[ df['label'] == 2 ].shape

(129, 13)

In [82]:
df[ df['label'] == 3 ].shape

(659, 13)

In [83]:
df[ df['label'] == 4 ].shape

(762, 13)

**Save Data Frame to file:**

In [85]:
df.to_csv("../data/processed/20191209233601.19044.gkg.Labeled.txt", sep='\t')