## Data Exploration
We do a hypothesis testing to find 'good' and 'bad' words in the following way (Notice that we have around 80k insincere questions in a data set of 1.3M)
* We make 100 samples ($df_1$, $df_2$, ...$df_{100}$), each containing 80k questions chosen at random.


* For each word and for each of our samples we find the quotient  
$$w_i =\frac{\text{number of times the word is in }df_i}{\text{number of words in }df_i}$$


* We make a sample dg with only insincere questions (roughly 80k questions) and for each word we find
$$\hat{w} =\frac{\text{number of times the word is in dg}}{\text{number of words in dg}}$$


* At this moment, for each `word` we have $W:=[w_1, w_2, w_3, ..., w_{100}]$ that roughly describes how likely is it that the word appears in a random question, and $\hat{w}$ that describes how likely it is to find the word in an insincere question.


* Notice that if it is the case that our `word` is a 'bad word' we expect it to be more common in the sample $dg$ than in the samples $df_i$, and hence we say that `word` is a `bad word` if $\hat{w}> quintil_{95}(W)$. Similarly, we say that `word` is a `good word` if $\hat{w} < quintil_{5}(W)$


* At the end of this notebook we provide a data frame with `bad words` and another data frame with `good words` according to the previous definitions.  

In [2]:
import pandas as pd
import numpy as np
from get_good_and_bad_words import get_good_and_bad_words

In [2]:
data = pd.read_csv('train_inisincere.csv', index_col='qid')

In [3]:
p =data.target.value_counts().plot.bar(title='target')

In [11]:
good_words, bad_words = get_good_and_bad_words(data[:100000])

In [12]:
good_words.shape

(735,)

In [13]:
bad_words.shape

(2329,)

In [14]:
bad_words.sort_values(ascending=False)

muslims        62.725718
liberals       50.063252
indians        46.419911
blacks         44.518679
gays           44.022740
trump          43.851490
women          41.710316
tamils         39.257291
ass            38.694949
men            38.010088
hindus         36.839726
americans      35.862371
white          33.306424
democrats      30.213616
muslim         30.031719
people         30.010778
jews           28.262370
castration     28.199985
gay            27.804152
christians     27.512319
whites         26.380525
obama          25.999169
europeans      25.145652
castrated      24.791974
asians         24.780436
fuck           23.300383
racist         23.228772
hate           22.912011
girls          22.649177
black          22.033753
                 ...    
values          0.066777
abraham         0.064749
holy            0.064681
fascism         0.061846
celebrating     0.061734
respected       0.061655
executed        0.061628
pushing         0.059191
nowhere         0.057886


In [15]:
good_words.sort_values(ascending=False)

best            15.714790
good             7.362767
get              6.544825
someone          5.522103
difference       5.377975
work             5.189375
engineering      5.142801
use              5.094091
college          4.922048
job              4.896898
student          4.830560
study            4.683208
life             4.657521
online           4.636310
possible         4.574371
learn            4.383128
way              4.369264
computer         4.225865
university       4.058048
company          3.992585
business         3.925053
app              3.858476
phone            3.840721
one              3.791584
books            3.776788
data             3.754845
water            3.656811
exam             3.623839
book             3.621622
find             3.592053
                  ...    
risk             0.093545
getting          0.091541
idea             0.090839
regular          0.089020
open             0.085547
rules            0.080895
channel          0.080344
differ      