#The mysteries behind word associations¶

Welcome to my first notebook. Here, our goal is to extract interesting features from the Wordgame dataset, 
a dataset containing 0.3M word-word associations scraped from Word Association Games running on 10 internet forums. 
This basic feature extraction could be useful to gain insight into the the properties of the dataset, 
which could be used to construct better classification models.

##Data exploration

Let's open up the dataset.

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/processed/wordgame_201706.csv')
df.head(15)

Unnamed: 0,author,word1,word2,source
0,4688,Crows,Feet,the_fishy
1,4841,Salute,Respect,the_fishy
2,1732,Pride,Arrogance,gog
3,1272,knife,butter,gog
4,418,Bed Head,My hair right now,atu2
5,383,emotion,Feeling,atu2
6,3213,drive,taxi,sas
7,4688,Faraway,Distant,the_fishy
8,483,Respect,RELEVANT,atu2
9,1804,suburb,downtown,gog


And print some statistics of the dataset.

In [5]:
print("Total number of word pairs: " + str(len(df)))
# and more..

Total number of word pairs: 334012


Indeed we have approximately ~0.3M word-pairs. Considering that each word appears twice in the dataset, we also have ~0.3M words contributed by different users at different forums. Due to filtering of pairs containing NaN (empty) values, this is not exactly true, but we simply ignore that for now. Given this enormous bag of words, what would be the most frequently occurring words? Would it be random words like 'fork' or would it represent the most important aspects of life?

In [6]:
# convert all words to lowercase, otherwise 'Fork' and 'fork' will be counted seperately
df['word1'] = df['word1'].map(str).apply(lambda x: x.lower())
df['word2'] = df['word2'].map(str).apply(lambda x: x.lower())
print(df['word2'].value_counts().head(7))

water    647
time     538
music    534
love     459
money    454
fire     446
food     439
Name: word2, dtype: int64


## No forks...
It appears that the most frequent words occuring in association games, are not random at all. Some of the words even are the most important aspects of life, water and food for example...   

(Maybe move to another notebook?)



In [16]:
nt = df[(df['source'] != "wrongplanet") | (df['source'] != "aspiecentral")]
#nt['nt'] = 1
asd  = df[(df['source'] == "wrongplanet") | (df['source'] == "aspiecentral")]
#asd['nt'] = 0
print(nt['word2'].value_counts().head(17))
print(asd['word2'].value_counts().head(17))

water    647
time     538
music    534
love     459
money    454
fire     446
food     439
game     407
house    405
dog      405
ball     393
man      391
red      387
life     373
green    373
death    372
light    357
Name: word2, dtype: int64
water        159
food         131
music        130
money        122
death        112
dog          107
fire         106
time         101
love          95
blood         92
cat           91
red           87
tree          85
chocolate     85
green         85
war           83
fish          83
Name: word2, dtype: int64
