# Loading data

Upload the competition's .tsv file to Google Drive

In [0]:
#Connecting to Google Drive for dataset access
!pip install -U -q PyDrive
 
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
 
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

file_list = drive.ListFile({'q': "'1NmIiXgqRs1TTIn3wFZG2aa0QCwkr89y7' in parents and trashed=false"}).GetList()
for file1 in file_list:
  print('title: %s, id: %s' % (file1['title'], file1['id']))

title: gendered-pronoun-resolution.zip, id: 1m45CHcYuTinEBTwoJ_qAMZU1_Cx6V2q9
title: test_stage_1.tsv, id: 1Ggca0VY2PXezXX9srkbEBEH9f1EHQBSp
title: sample_submission_stage_1.csv, id: 1gWpfR9MtjQfl9kqCYUDUIMV5ZuIO60SP


In [0]:
dataset_name = "test_stage_1.tsv"

def get_data_set():
  if (dataset_name == ""):
    requested_dataset_name = input().lower()
    requested_dataset_id = 0
  else:
    requested_dataset_name = dataset_name
  for file in file_list:
    if file['title'].lower() == requested_dataset_name:
      requested_dataset_id = file['id']
      break
      
  if requested_dataset_id == 0:
    raise FileNotFoundError('1','No dataset with the name ' + requested_dataset_name + ' could be found. Are you in the correct folder?')
  
  data = drive.CreateFile({'id': requested_dataset_id})
  data.GetContentFile(requested_dataset_name)
  
  import pandas as po
  dataset = po.read_csv(requested_dataset_name, sep = "\t")
  return dataset

dataset = get_data_set()

dataset.head(5)

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,B,B-offset,URL
0,development-1,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,Pauline,207,http://en.wikipedia.org/wiki/List_of_Teachers_...
1,development-2,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,Bernard Leach,251,http://en.wikipedia.org/wiki/Warren_MacKenzie
2,development-3,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,De la Sota,246,http://en.wikipedia.org/wiki/Jos%C3%A9_Manuel_...
3,development-4,The current members of Crime have also perform...,his,321,Hell,174,Henry Rosenthal,336,http://en.wikipedia.org/wiki/Crime_(band)
4,development-5,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,Rivera,294,http://en.wikipedia.org/wiki/Jessica_Rivera


# Data exploration

In [0]:
import pandas as po
import numpy as no

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 9 columns):
ID                2000 non-null object
Text              2000 non-null object
Pronoun           2000 non-null object
Pronoun-offset    2000 non-null int64
A                 2000 non-null object
A-offset          2000 non-null int64
B                 2000 non-null object
B-offset          2000 non-null int64
URL               2000 non-null object
dtypes: int64(3), object(6)
memory usage: 140.7+ KB


### Unique values

Notes:


*   ID - all unique values, as expected. Good. May want to convert to integer.
*  Text - One of the text examples in Text is the same. Further investigation is required.
*  Pronoun - has 9 unique values. Check for casing issues or trolls
*  Pronoun offset - what does this mean exactly? What is the connection between the value of the offset and correct identification of gender, if any? (Will find out during feature selection and model development)
*  A - name of a person.
*  A-offset - not sure what this is either
*  B - name of another person (in the same Text value)
*  B-offset - not sure what this is either
*  URL - Some articles share links. Articles with shared authors may be a confounding factor.


In [0]:
for i in range(len(dataset.columns)):
  print(dataset.columns[i], len(dataset[dataset.columns[i]].unique()))

ID 2000
Text 1999
Pronoun 9
Pronoun-offset 444
A 1793
A-offset 450
B 1774
B-offset 480
URL 1834


In [0]:
A_names = set(dataset["A"])
B_names = set(dataset["B"])

intersection = A_names.intersection(B_names)
len(intersection)

230

###  Figuring out what offset means

Theory 1: distance from the start of the sentence to the first character of the specific word (1 increment = 1 word)

In [0]:
any(list([False,2,3]))

True

In [0]:
def tokenize(txt):
  i = 0
  tokens = []
  
  # denote the end
  text = txt
  text += '#' 
  text += '@'
  text += '&'
  
  word = ""
  while (text[i] != '#' and text[i+1] != '@' and text[i+2] != '&'):
    if (any([text[i] == punc for punc in [',', '.', '?', '!', '"', ' ', ':', ';','(', ')']])):
      if (word != ""):
        tokens.append(word)
        word = ""
      punctuation = text[i];
      tokens.append(punctuation)
    else:
      word += text[i]
    i = i + 1
  if(word!=""):
    tokens.append(word)
  return tokens

# tokenize test
a = tokenize("Hello, World!")
a

['Hello', ',', ' ', 'World', '!']

In [0]:
# tokenizing Text in the first row

test_row = dataset.iloc[0]
text = test_row["Text"]

test_row_tokens = tokenize(text)

print(len(test_row_tokens)) # number of words
print(text) # the entire text

155
Zoe Telford -- played the police officer girlfriend of Simon, Maggie. Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again. Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class. Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.


A sentence is 73 words long. The offset can't be the distance between the start of text and the specified word. Maybe the characters?

In [0]:
def character_count(tokens):
  punctuation = []
  length = 0
  for token in tokens:
    if (not any([token == punc for punc in punctuation])):
      length += len(token)
  return length

character_count(test_row_tokens)

426

In [0]:
print(test_row_tokens)

['Zoe', ' ', 'Telford', ' ', '--', ' ', 'played', ' ', 'the', ' ', 'police', ' ', 'officer', ' ', 'girlfriend', ' ', 'of', ' ', 'Simon', ',', ' ', 'Maggie', '.', ' ', 'Dumped', ' ', 'by', ' ', 'Simon', ' ', 'in', ' ', 'the', ' ', 'final', ' ', 'episode', ' ', 'of', ' ', 'series', ' ', '1', ',', ' ', 'after', ' ', 'he', ' ', 'slept', ' ', 'with', ' ', 'Jenny', ',', ' ', 'and', ' ', 'is', ' ', 'not', ' ', 'seen', ' ', 'again', '.', ' ', 'Phoebe', ' ', 'Thomas', ' ', 'played', ' ', 'Cheryl', ' ', 'Cassidy', ',', ' ', "Pauline's", ' ', 'friend', ' ', 'and', ' ', 'also', ' ', 'a', ' ', 'year', ' ', '11', ' ', 'pupil', ' ', 'in', ' ', "Simon's", ' ', 'class', '.', ' ', 'Dumped', ' ', 'her', ' ', 'boyfriend', ' ', 'following', ' ', "Simon's", ' ', 'advice', ' ', 'after', ' ', 'he', ' ', "wouldn't", ' ', 'have', ' ', 'sex', ' ', 'with', ' ', 'her', ' ', 'but', ' ', 'later', ' ', 'realised', ' ', 'this', ' ', 'was', ' ', 'due', ' ', 'to', ' ', 'him', ' ', 'catching', ' ', 'crabs', ' ', 'off', '

In [0]:
def character_offset(tokens, end_token):
  selected_tokens = []
  for token in tokens:
    if (token == end_token):
      break;
    else:
      selected_tokens.append(token)
  return character_count(selected_tokens)

test_end_token = tokenize(test_row['A'])[0] # "Cheryl"

In [0]:
test_row['A-offset'] == character_offset(test_row_tokens, test_end_token)

True

Aha! Offset is the the number of characters (including spaces!) in column Text away from the start of the specified token in column A. Let's confirm with ALL the rows now.

In [0]:
dataset.shape[0]

2000

In [0]:
dataset_offsets = list(dataset["A-offset"])
my_offsets = []

my_booleans = []

for i in range(dataset.shape[0]):
  row = dataset.iloc[i]
  row_text = row["Text"]
  
  row_tokens = tokenize(row_text)
  
  row_end_token = tokenize(row["A"])[0]
  offset_with_A = character_offset(row_tokens, row_end_token)
  
  my_offsets.append(offset_with_A)
  my_booleans.append(row["A-offset"] == offset_with_A)


print(dataset_offsets)
print(my_offsets)

  
row  

[191, 228, 173, 174, 219, 236, 152, 173, 255, 196, 168, 217, 247, 275, 219, 192, 334, 181, 281, 330, 113, 317, 296, 273, 210, 92, 238, 204, 153, 241, 248, 229, 345, 238, 206, 340, 227, 246, 209, 161, 194, 312, 348, 361, 245, 335, 154, 259, 138, 51, 245, 247, 189, 104, 188, 104, 183, 159, 194, 168, 305, 312, 384, 176, 152, 129, 187, 304, 212, 181, 225, 29, 279, 302, 290, 139, 228, 251, 296, 341, 169, 285, 232, 220, 240, 405, 185, 231, 296, 313, 179, 511, 117, 174, 364, 165, 403, 240, 257, 254, 214, 0, 224, 249, 183, 215, 158, 196, 217, 172, 0, 0, 184, 264, 250, 290, 188, 285, 402, 350, 224, 406, 435, 320, 192, 324, 219, 185, 129, 216, 259, 242, 186, 196, 215, 354, 183, 307, 322, 248, 409, 342, 9, 268, 251, 501, 298, 225, 318, 24, 206, 158, 379, 224, 143, 316, 497, 347, 243, 137, 167, 220, 277, 256, 439, 320, 212, 189, 255, 263, 221, 227, 295, 126, 204, 243, 237, 192, 258, 181, 200, 207, 237, 166, 53, 21, 250, 328, 220, 282, 307, 282, 327, 203, 260, 430, 240, 355, 187, 103, 138, 194, 184

ID                                                 development-2000
Text              Watkins was a close friend of Hess' first wife...
Pronoun                                                         her
Pronoun-offset                                                  373
A                                                         Elizabeth
A-offset                                                        293
B                                                           Watkins
B-offset                                                        347
URL                      http://en.wikipedia.org/wiki/Linda_Watkins
Name: 1999, dtype: object

In [0]:
tokenize(dataset.iloc[2]["A"])

[]

In [0]:
test_row = dataset.iloc[0]
tokenize(test_row["A"])[0]

In [0]:
row_four = dataset.iloc[3]
row_four_text = row_four["Text"]
row_four_A = tokenize(row_four["A"])[0]
row_four_tokens = tokenize(row_four_text)
print(row_four_A)
print(row_four_tokens)

#fixed

Hell
['The', ' ', 'current', ' ', 'members', ' ', 'of', ' ', 'Crime', ' ', 'have', ' ', 'also', ' ', 'performed', ' ', 'in', ' ', 'San', ' ', 'Francisco', ' ', 'under', ' ', 'the', ' ', 'band', ' ', 'name', ' ', "''Remote", ' ', 'Viewers``', '.', ' ', 'Strike', ' ', 'has', ' ', 'published', ' ', 'two', ' ', 'works', ' ', 'of', ' ', 'fiction', ' ', 'in', ' ', 'recent', ' ', 'years:', ' ', 'Ports', ' ', 'of', ' ', 'Hell', ',', ' ', 'which', ' ', 'is', ' ', 'listed', ' ', 'in', ' ', 'the', ' ', 'Rock', ' ', 'and', ' ', 'Roll', ' ', 'Hall', ' ', 'of', ' ', 'Fame', ' ', 'Library', ',', ' ', 'and', ' ', 'A', ' ', 'Loud', ' ', 'Humming', ' ', 'Sound', ' ', 'Came', ' ', 'from', ' ', 'Above', '.', ' ', 'Rank', ' ', 'has', ' ', 'produced', ' ', 'numerous', ' ', 'films', ' ', '(under', ' ', 'his', ' ', 'real', ' ', 'name', ',', ' ', 'Henry', ' ', 'Rosenthal)', ' ', 'including', ' ', 'the', ' ', 'hit', ' ', 'The', ' ', 'Devil', ' ', 'and', ' ', 'Daniel', ' ', 'Johnston', '.']


In [0]:
row_six = dataset.iloc[5]
row_six_text = row_six["Text"]
row_six_A = tokenize(row_six["A"])[0]
row_six_tokens = tokenize(row_six_text)
print(row_six_A)
print(row_six_tokens)

Collins
['Sandra', ' ', 'Collins', ' ', 'is', ' ', 'an', ' ', 'American', ' ', 'DJ', '.', ' ', 'She', ' ', 'got', ' ', 'her', ' ', 'start', ' ', 'on', ' ', 'the', ' ', 'West', ' ', 'Coast', ' ', 'of', ' ', 'the', ' ', 'U', '.', 'S', '.', ' ', 'in', ' ', 'Phoenix', ',', ' ', 'Arizona', ' ', 'and', ' ', 'into', ' ', 'residencies', ' ', 'in', ' ', 'Los', ' ', 'Angeles', ',', ' ', 'and', ' ', 'eventually', ' ', 'moved', ' ', 'towards', ' ', 'trance', '.', ' ', 'She', ' ', 'used', ' ', 'American', ' ', 'producers', ' ', 'to', ' ', 'give', ' ', 'herself', ' ', 'a', ' ', 'unique', ' ', 'sound', '.', ' ', 'Collins', ' ', 'performed', ' ', 'for', ' ', 'an', ' ', 'estimated', ' ', '80', ',', '000', ' ', 'people', ' ', 'on', ' ', 'the', ' ', 'first', ' ', 'night', ' ', 'of', ' ', 'Woodstock', ' ', "'99", ',', ' ', 'and', ' ', 'was', ' ', 'the', ' ', 'first', ' ', 'female', ' ', 'DJ', ' ', 'featured', ' ', 'in', ' ', 'the', ' ', 'Tranceport', ' ', 'series', ' ', 'of', ' ', 'influential', ' ', 'rec

Conclusions: 
*  the keys in Column A could be repeated throughout the sentence (as seen through row_six and "Collins"). 
*  The character offset includes all types of punctuation, and all characters.

Notes:
* my tokenize() function distinguishes between characters that make up words and punctuation. Punctuation may prove important in determining whom the pronoun is truly referring to.

## Developing baseline models

By now, we know what offset is, and its potential significance in determining whom the pronoun is referring to. We know enough to develop a baseline model. Of course, without labels, this makes classification hard (and reading 2000 sentences to manually label each one is a painstaking process that completely misses the point of this challenge: developing an unsupervised learning model.)