# _Big Discoveries with Big Data_

#### From outing years of rigging behind some of the biggest games in sports, to shedding light on the slumlords of the nation’s major cities, to capturing and counting sunspots, data science serves one of the greatest roles in modern-day mystery solving and crime fighting. Data science discoveries have paved the way for incredible and sometimes shocking revelations in all sorts of topics ranging from climate science to the drug market. Learn how to manipulate massive amounts of data in order to find patterns in the chaos and help make the world a better place, data set by data set!


I took [this](http://nbviewer.jupyter.org/gist/nealcaren/5105037) and used it on a custom corpus but due to the discrepancies between Python versions, I need to update it for Python3.

### Questions to Consider:

- What are some ways we can visualize data?

Segments of code, examples, visuals, tests they can implement in python notebook:

Answer for question, analysis/summary of the data they just played with:

### What Have We Learned? And What's Next?

In [89]:
from __future__ import division

import glob
import nltk
nltk.download('punkt')

from string import punctuation

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ivy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [90]:
#Two lists  of words that are used when a man or woman is present, based on Danielle Sucher's https://github.com/DanielleSucher/Jailbreak-the-Patriarchy
male_words=set(['Trump','guy','spokesman','chairman',"men's",'men','him',"he's",'his','boy','boyfriend','boyfriends','boys','brother','brothers','dad','dads','dude','father','fathers','fiance','gentleman','gentlemen','god','grandfather','grandpa','grandson','groom','he','himself','husband','husbands','king','male','man','mr','nephew','nephews','priest','prince','son','sons','uncle','uncles','waiter','widower','widowers'])
female_words=set(['heroine','spokeswoman','chairwoman',"women's",'actress','women',"she's",'her','aunt','aunts','bride','daughter','daughters','female','fiancee','girl','girlfriend','girlfriends','girls','goddess','granddaughter','grandma','grandmother','herself','ladies','lady','lady','mom','moms','mother','mothers','mrs','ms','niece','nieces','priestess','princess','queens','she','sister','sisters','waitress','widow','widows','wife','wives','woman'])


In [91]:
def gender_the_sentence(sentence_words):
    mw_length=len(male_words.intersection(sentence_words))
    fw_length=len(female_words.intersection(sentence_words))

    if mw_length>0 and fw_length==0:
        gender='male'
    elif mw_length==0 and fw_length>0: 
        gender='female'
    elif mw_length>0 and fw_length>0: 
        gender='both'
    else:
        gender='none'
    return gender

In [92]:
def is_it_proper(word):
        if word[0]==word[0].upper():
            case='upper'
        else:
            case='lower'
        
        word_lower=word.lower()
        try:
            proper_nouns[word_lower][case] = proper_nouns[word_lower].get(case,0)+1
        except Exception as e:
            #This is triggered when the word hasn't been seen yet
            proper_nouns[word_lower]= {case:1}

In [93]:
def increment_gender(sentence_words,gender):
    sentence_counter[gender]+=1
    word_counter[gender]+=len(sentence_words)
    for word in sentence_words:
        word_freq[gender][word]=word_freq[gender].get(word,0)+1

In [94]:
sexes=['male','female','none','both']
sentence_counter={sex:0 for sex in sexes}
word_counter={sex:0 for sex in sexes}
word_freq={sex:{} for sex in sexes}
proper_nouns={}

In [95]:
file_list=glob.glob('articles/*.txt')

In [96]:
for file_name in file_list:
    #Open the file
    text = open(file_name,'rb').read()
    
    #Split into sentences
    try:
        sentences = tokenizer.tokenize(text)
    except:
        pass
    
    
    for sentence in sentences:
        #word tokenize and strip punctuation
            sentence_words=sentence.split()
            sentence_words=[w.strip(punctuation) for w in sentence_words 
                            if len(w.strip(punctuation))>0]
            
            #figure out how often each word is capitalized
            [is_it_proper(word) for word in sentence_words[1:]]

            #lower case it
            sentence_words=set([w.lower() for w in sentence_words])
            
            #Figure out if there are gendered words in the sentence by computing the length of the intersection of the sets
            gender=gender_the_sentence(sentence_words)

            #Increment some counters
            increment_gender(sentence_words,gender)

In [97]:
proper_nouns=set([word for word in proper_nouns if  
                  proper_nouns[word].get('upper',0) / 
                  (proper_nouns[word].get('upper',0) + 
                   proper_nouns[word].get('lower',0))>.50])

In [98]:
common_words=set([w for w in sorted (word_freq['female'],
                                     key=word_freq['female'].get,reverse=True)[:1000]]+[w for w in sorted (word_freq['male'],key=word_freq['male'].get,reverse=True)[:1000]])
print(word_freq['male'])

common_words=list(common_words-male_words-female_words-proper_nouns)

{'period': 5, 'very': 20, '20': 5, 'said': 25, 'decided': 10, 'go': 10, 'caterpillar': 5, 'bothers': 5, 'told': 10, 'fox': 10, 'story': 5, 'how': 10, 'number': 5, 'so': 15, 'if': 20, 'should': 5, 'hope': 10, 'trump': 30, 'congratulate': 5, 'about': 40, 'watched': 5, 'win': 30, 'congress': 10, 'versus': 5, 'today': 10, 'going': 25, 'tractors': 5, 'enough': 5, 'fact': 5, 'were': 5, 'which': 5, 'probably': 5, 'went': 5, 'this': 20, 'by': 5, 'rubio': 5, 'bring': 5, 'uniting': 5, 'excavation': 5, 'look': 10, 'him': 45, 'hasn’t': 25, 'all': 10, 'nasty': 5, 'it’s': 10, 'often': 5, 'be': 10, 'earlier': 5, 'question': 30, 'far': 5, 'campaign': 10, 'who': 15, 'mr': 25, 'no': 10, 'man': 5, 'orders': 5, 'lightweight': 5, 'your': 10, 'fall': 5, 'came': 5, 'room': 5, 'cannot': 5, 'controlled': 5, 'make': 5, 'does': 15, 'endorsement': 5, 'movement': 5, 'bad': 5, 'percent': 5, 'japan': 5, 'washington': 5, 'did': 10, 'with': 35, 'doesn’t': 5, 'because': 5, 'ultimate': 5, 'wins': 5, 'anything': 25, 'cut

In [99]:
male_percent={word:(word_freq['male'].get(word,0) / word_counter['male']) 
              / (word_freq['female'].get(word,0) / word_counter['female']+word_freq['male'].get(word,0)/word_counter['male']) for word in common_words}

In [100]:
header ='Ratio\tMale\tFemale\tWord'
print('Male words')
print(header)
for word in sorted (male_percent,key=male_percent.get,reverse=True)[:50]:
    try:
        ratio=male_percent[word]/(1-male_percent[word])
    except:
        ratio=100
    print('%.1f\t%02d\t%02d\t%s' % (ratio,word_freq['male'].get(word,0),word_freq['female'].get(word,0),word))

print('\n'*2)
print('Female words')
print(header)
for word in sorted (male_percent,key=male_percent.get,reverse=False)[:50]:
    try:
        ratio=(1-male_percent[word])/male_percent[word]
    except:
        ratio=100
    print('%.1f\t%01d\t%01d\t%s' % (ratio,word_freq['male'].get(word,0),word_freq['female'].get(word,0),word))

Male words
Ratio	Male	Female	Word
100.0	05	00	period
100.0	10	00	decided
100.0	05	00	bothers
100.0	05	00	businessman
100.0	05	00	vincente
100.0	10	00	how
100.0	05	00	cannot
100.0	05	00	clear
100.0	05	00	they’ve
100.0	10	00	hope
100.0	05	00	concept
100.0	05	00	speaker
100.0	10	00	word
100.0	30	00	win
100.0	05	00	versus
100.0	05	00	business
100.0	10	00	today
100.0	05	00	tractors
100.0	05	00	fact
100.0	05	00	were
100.0	05	00	room
100.0	05	00	total
100.0	05	00	excavation
100.0	05	00	nominee
100.0	05	00	nasty
100.0	05	00	often
100.0	05	00	earlier
100.0	15	00	who
100.0	05	00	arrogance
100.0	10	00	campaign
100.0	05	00	become
100.0	10	00	no
100.0	05	00	orders
100.0	05	00	hate
100.0	05	00	lightweight
100.0	10	00	your
100.0	05	00	movement.”
100.0	05	00	fall
100.0	05	00	come
100.0	05	00	care
100.0	15	00	does
100.0	05	00	coming
100.0	05	00	movement
100.0	15	00	mine
100.0	05	00	percent
100.0	10	00	hard
100.0	05	00	wins
100.0	05	00	cut
100.0	10	00	night
100.0	05	00	congressmen



Female words
Ratio	