## Data Exploration
We do a hypothesis testing to find 'good' and 'bad' words in the following way (Notice that we have around 80k insincere questions in a data set of 1.3M)
* We make 100 samples ($df_1$, $df_2$, ...$df_{100}$), each containing 80k questions chosen at random.


* For each word and for each of our samples we find the quotient  
$$w_i =\frac{\text{number of times the word is in }df_i}{\text{number of words in }df_i}$$


* We make a sample dg with only insincere questions (roughly 80k questions) and for each word we find
$$\hat{w} =\frac{\text{number of times the word is in dg}}{\text{number of words in dg}}$$


* At this moment, for each `word` we have $W:=[w_1, w_2, w_3, ..., w_{100}]$ that roughly describes how likely is it that the word appears in a random question, and $\hat{w}$ that describes how likely it is to find the word in an insincere question.


* Notice that if it is the case that our `word` is a 'bad word' we expect it to be more common in the sample $dg$ than in the samples $df_i$, and hence we say that `word` is a `bad word` if $\hat{w}> quintil_{95}(W)$. Similarly, we say that `word` is a `good word` if $\hat{w} < quintil_{5}(W)$


* At the end of this notebook we provide a data frame with `bad words` and another data frame with `good words` according to the previous definitions.  

In [1]:
import pandas as pd
import numpy as np
from get_good_and_bad_words import get_good_and_bad_words

In [2]:
data = pd.read_csv('train_inisincere.csv', index_col='qid')

In [3]:
p =data.target.value_counts().plot.bar(title='target')

In [4]:
good_words, bad_words = get_good_and_bad_words(data[:100000])

In [5]:
good_words.shape

(728,)

In [6]:
bad_words.shape

(2331,)

In [7]:
bad_words.sort_values(ascending=False)

muslims         62.773154
liberals        62.098782
indians         54.841423
blacks          43.280292
trump           41.182579
women           39.997990
men             39.001293
americans       38.642570
tamils          37.598218
jews            36.599325
gays            36.262885
whites          34.713537
hindus          34.361256
people          34.221182
muslim          32.616544
democrats       31.122052
ass             31.020095
christians      30.423896
white           28.345539
castration      28.221813
castrate        28.168553
europeans       28.043078
obama           27.503103
castrated       25.125905
feminists       24.500304
racist          24.300176
zionist         24.154615
gay             23.832707
fuck            23.785949
atheists        23.544169
                  ...    
movements        0.058386
puts             0.057851
exclusively      0.057307
touched          0.056808
justice          0.056430
born             0.054951
gangs            0.054843
kings       

In [8]:
good_words.sort_values(ascending=False)

best           16.504913
engineering     6.120025
good            6.077184
get             5.991973
someone         5.502169
online          5.392302
learn           5.208268
company         5.136617
job             4.929741
work            4.927568
difference      4.779182
life            4.733262
possible        4.668606
business        4.525309
science         4.434187
computer        4.362626
use             4.300982
study           4.298061
experience      4.221857
books           4.217862
career          4.162890
start           4.131409
college         4.057600
student         4.008607
data            3.908517
tips            3.860390
app             3.838421
way             3.788955
exam            3.745628
time            3.739640
                 ...    
amount          0.114707
getting         0.113983
details         0.111854
various         0.101674
happened        0.100446
channel         0.099802
differ          0.099508
driving         0.098794
topic           0.098295
