# CS579: Lecture 12  

** Demographic Inference I**

*[Dr. Aron Culotta](http://cs.iit.edu/~culotta)*  
*[Illinois Institute of Technology](http://iit.edu)*

# Midterm review 

~5 True/False, ~6 short answer
 
**Topics:** 
- Twitter API 
 - what comes in a tweet? 
 - how do rate limits work? 
 - can you understand API documentation? 
- graph basics: 
 - directed/undirected 
 - path 
 - cycle 
 - connected 
 - connected component 
 - degree (distribution) 
 - diameter 
 - average path length 
 - clustering coefficient 
- modeling networks 
 - random graphs 
 - regular graphs 
 - rewired graphs 
 - what makes a small world? 
- community detection 
 - girvan-newman (betweenness) 
 - graph cuts 
 - representing graphs with matrices 
 - graph laplacian 
- link prediction 
 - shortest path 
 - common neighbors 
 - jaccard 
 - preferential attachment 
 - sim rank 
 - evaluation 
- information cascades 
 - urn experiment 
 - bayes' theorem for decision making 
 - game-theoretic model 
 - maximizing payoff 
	- cluster density 
- sentiment analysis 
 - lexicon approach 
 - machine learning 
 
**Question types:** 
- What does this algorithm output? 
 - E.g., what is jaccard score for a specific link? 
 - E.g., what is the next step in girvan-newman? 
- What does this code do? 
 - E.g., I give you a new graph-generating algorithm, tell me what it produces 
- Write a new algorithm 
 - E.g., provide pseudo-code the linear-threshold cascade model 
- True/False 
 - E.g., small world graphs have higher clustering coefficients than random graphs.

**dem·o·graph·ics**

statistical data relating to the population and particular groups within it.

E.g., age, ethnicity, gender, income, ...

# Why Demographics?

- Marketing
  - Who are my customers?
  - Who are my competitors' customers?
  - E.g., [DemographicsPro](http://www.demographicspro.com/samples#c=%40FamilyGuyonFOX)
  
- Social Media as Surveys
  - E.g., 45% of tweets express positive sentiment toward Pres. Obama
  - Who wrote those tweets?
  
- Health
  - 2% of Facebook users are expressing flu-like symptoms
  - Are they representative of the full population?



** User profiles vary from site to site. **

![rahm](rahm.png)

![rahm-fb](rahm-fb.png)

![rahm-li](rahm-li.png)

# Approaches

- Clever use of external data
  - E.g., U.S. Census name lists for gender
- Look for keywords in profile
  - "African American Male"
  - "Happy 21st birthday to me"
- Machine Learning

In [1]:
# Guessing gender
import configparser
import sys
from TwitterAPI import TwitterAPI

def get_twitter(config_file):
    """ Read the config_file and construct an instance of TwitterAPI.
    Args:
      config_file ... A config file in ConfigParser format with Twitter credentials
    Returns:
      An instance of TwitterAPI.
    """
    #config = configparser.ConfigParser()
    #config.read(config_file)
    #twitter = TwitterAPI(
    #               config.get('twitter', 'consumer_key'),
    #               config.get('twitter', 'consumer_secret'),
    #               config.get('twitter', 'access_token'),
    #               config.get('twitter', 'access_token_secret'))
    
    #my info
    consumer_key = 'Q6sCDic9j8mNMhZB2BysN69vr'
    consumer_secret = 'FdaQjrFThC0F3xddTZ1WnwhNmCtw08es0X1EIMCloQmOfxUc47'
    access_token = '3584891536-Lm7t8eHMyf9l0MDXVdXwL4gj2xKZJ6EHzxp85ev'
    access_token_secret = 'cajDlqGO2ExbHh0cRLK4WOyzw1i5E7Q4EIIi4qe99EoG2'
    twitter = TwitterAPI(consumer_key, consumer_secret, access_token, access_token_secret)
    return twitter

twitter = get_twitter('twitter.cfg')
tweets = []
n_tweets=1000
for r in twitter.request('statuses/filter', {'track': 'i'}):
    tweets.append(r)
    if len(tweets) % 100 == 0:
        print('%d tweets' % len(tweets))
    if len(tweets) >= n_tweets:
        break
print('fetched %d tweets' % len(tweets))

100 tweets
200 tweets
300 tweets
400 tweets
500 tweets
600 tweets
700 tweets
800 tweets
900 tweets
1000 tweets
fetched 1000 tweets


In [2]:
# not all tweets are returned
# https://dev.twitter.com/streaming/overview/messages-types#limit_notices
[t for t in tweets if 'user' not in t][:6]

[{'limit': {'timestamp_ms': '1479935162809', 'track': 56}},
 {'limit': {'timestamp_ms': '1479935162811', 'track': 42}},
 {'limit': {'timestamp_ms': '1479935162820', 'track': 31}},
 {'limit': {'timestamp_ms': '1479935162908', 'track': 52}},
 {'limit': {'timestamp_ms': '1479935163842', 'track': 155}},
 {'limit': {'timestamp_ms': '1479935163845', 'track': 130}}]

In [3]:
# restrict to actual tweets
# (remove "deleted" tweets)
tweets = [t for t in tweets if 'user' in t]
print('fetched %d tweets' % len(tweets))

fetched 926 tweets


In [4]:
tweets[0]

{'contributors': None,
 'coordinates': None,
 'created_at': 'Wed Nov 23 21:06:01 +0000 2016',
 'entities': {'hashtags': [{'indices': [66, 75], 'text': 'TeenWolf'}],
  'media': [{'display_url': 'pic.twitter.com/dRGMEbkOXL',
    'expanded_url': 'https://twitter.com/MTVteenwolf/status/801258378257465344/photo/1',
    'id': 801258348733730820,
    'id_str': '801258348733730820',
    'indices': [76, 99],
    'media_url': 'http://pbs.twimg.com/tweet_video_thumb/Cx6kORGXEAQgp8F.jpg',
    'media_url_https': 'https://pbs.twimg.com/tweet_video_thumb/Cx6kORGXEAQgp8F.jpg',
    'sizes': {'large': {'h': 280, 'resize': 'fit', 'w': 500},
     'medium': {'h': 280, 'resize': 'fit', 'w': 500},
     'small': {'h': 190, 'resize': 'fit', 'w': 340},
     'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
    'source_status_id': 801258378257465344,
    'source_status_id_str': '801258378257465344',
    'source_user_id': 252709160,
    'source_user_id_str': '252709160',
    'type': 'photo',
    'url': 'https://t

In [5]:
# Print 10 names.
names = [t['user']['name'] for t in tweets]
names[:10]

['Jordyn schrock☀️',
 'BottomBoi',
 'Vasily',
 'O-DOG',
 'alli♔',
 'M.',
 'chloe',
 "Daisy 'ㅅ' ♡",
 'Milena',
 'DorianDawes']

In [6]:
# Fetch census name data from:
# http://www2.census.gov/topics/genealogy/1990surnames/
import requests
from pprint import pprint
males_url = 'http://www2.census.gov/topics/genealogy/' + \
            '1990surnames/dist.male.first'
females_url = 'http://www2.census.gov/topics/genealogy/' + \
              '1990surnames/dist.female.first'
males = requests.get(males_url).text.split('\n')
females = requests.get(females_url).text.split('\n')
print('males:')
pprint(males[:10])
print('females:')
pprint(females[:10])

males:
['JAMES          3.318  3.318      1',
 'JOHN           3.271  6.589      2',
 'ROBERT         3.143  9.732      3',
 'MICHAEL        2.629 12.361      4',
 'WILLIAM        2.451 14.812      5',
 'DAVID          2.363 17.176      6',
 'RICHARD        1.703 18.878      7',
 'CHARLES        1.523 20.401      8',
 'JOSEPH         1.404 21.805      9',
 'THOMAS         1.380 23.185     10']
females:
['MARY           2.629  2.629      1',
 'PATRICIA       1.073  3.702      2',
 'LINDA          1.035  4.736      3',
 'BARBARA        0.980  5.716      4',
 'ELIZABETH      0.937  6.653      5',
 'JENNIFER       0.932  7.586      6',
 'MARIA          0.828  8.414      7',
 'SUSAN          0.794  9.209      8',
 'MARGARET       0.768  9.976      9',
 'DOROTHY        0.727 10.703     10']


In [7]:
# Get names. 
male_names = set([m.split()[0].lower() for m in males if m])
female_names = set([f.split()[0].lower() for f in females if f])
print('%d male and %d female names' % (len(male_names), len(female_names)))
print('males:\n' + '\n'.join(list(male_names)[:10]))
print('\nfemales:\n' + '\n'.join(list(female_names)[:10]))

1219 male and 4275 female names
males:
michal
milan
dick
modesto
pedro
damien
gail
ethan
erich
tanner

females:
robbi
freda
helen
danyel
eleanora
jeanene
sally
zola
senaida
jackie


In [8]:
# Initialize gender of all tweets to unknown.
for t in tweets:
    t['gender'] = 'unknown'

In [9]:
# label a Twitter user's gender by matching name list.
import re
def gender_by_name(tweets, male_names, female_names):
    for t in tweets:
        name = t['user']['name']
        if name:
            # remove punctuation.
            name_parts = re.findall('\w+', name.split()[0].lower())
            if len(name_parts) > 0:
                first = name_parts[0].lower()
                if first in male_names:
                    t['gender'] = 'male'
                elif first in female_names:
                    t['gender'] = 'female'
                else:
                    t['gender'] = 'unknown'

gender_by_name(tweets, male_names, female_names)
# What's wrong with this approach?

In [10]:
from collections import Counter

def print_genders(tweets):
    counts = Counter([t['gender'] for t in tweets])
    print('%.2f of accounts are labeled with gender' % 
          ((counts['male'] + counts['female']) / sum(counts.values())))
    print('gender counts:\n', counts)
    for t in tweets[:20]:
        print(t['gender'], t['user']['name'])
    
print_genders(tweets)

0.32 of accounts are labeled with gender
gender counts:
 Counter({'unknown': 627, 'female': 176, 'male': 123})
unknown Jordyn schrock☀️
unknown BottomBoi
unknown Vasily
unknown O-DOG
unknown alli♔
unknown M.
female chloe
female Daisy 'ㅅ' ♡
unknown Milena
unknown DorianDawes
female bernice
male Ryan Sudol
unknown BG!
unknown ⁶⁹
unknown Bader Ahmed A.Wahed
female Jo'ella DeVille
male Logan Brown
female Diane Brodalski
female Jenna
unknown Gossip Girl Quotes


In [11]:
# What about ambiguous names?
def print_ambiguous_names(male_names, female_names):
    ambiguous = [n for n in male_names if n in female_names]  # names on both lists
    print('found %d ambiguous names:\n'% len(ambiguous))
    print('\n'.join(ambiguous[:20]))
    
print_ambiguous_names(male_names, female_names)

found 331 ambiguous names:

michal
gail
samuel
jackie
jewell
robert
glenn
ryan
louis
valentine
laurence
carrol
lane
victor
michel
frankie
larry
darnell
augustine
frances


In [12]:
# Keep names that are more frequent in one gender than the other.
def get_percents(name_list):
    # parse raw data to extract, e.g., the percent of males names John.
    return dict([(n.split()[0].lower(), float(n.split()[1]))
                  for n in name_list if n])

males_pct = get_percents(males)
females_pct = get_percents(females)

# Assign a name as male if it is more common among males than femals.
male_names = set([m for m in male_names if m not in female_names or
              males_pct[m] > females_pct[m]])
female_names = set([f for f in female_names if f not in male_names or
              females_pct[f] > males_pct[f]])

print_ambiguous_names(male_names, female_names)
print('%d male and %d female names' % (len(male_names), len(female_names)))

found 0 ambiguous names:


1146 male and 4017 female names


In [13]:
# Relabel twitter users (compare with above)
gender_by_name(tweets, male_names, female_names)
print_genders(tweets)

0.32 of accounts are labeled with gender
gender counts:
 Counter({'unknown': 627, 'female': 193, 'male': 106})
unknown Jordyn schrock☀️
unknown BottomBoi
unknown Vasily
unknown O-DOG
unknown alli♔
unknown M.
female chloe
female Daisy 'ㅅ' ♡
unknown Milena
unknown DorianDawes
female bernice
male Ryan Sudol
unknown BG!
unknown ⁶⁹
unknown Bader Ahmed A.Wahed
female Jo'ella DeVille
male Logan Brown
female Diane Brodalski
female Jenna
unknown Gossip Girl Quotes


In [14]:
# Who are the unknowns?
# "Filtered" data can have big impact on analysis.
unknown_names = Counter(t['user']['name']
                        for t in tweets if t['gender'] == 'unknown')
unknown_names.most_common(20)

[('.', 4),
 ('ㅤ', 4),
 ('İŞ İLANLARI', 3),
 ('絶対リフォロー１００％です！！', 3),
 ('ㅤㅤㅤ', 2),
 ('Şafak Malatya 🇹🇷', 2),
 ('•', 2),
 ('LCoghlan', 2),
 ('deniz gök', 2),
 ('Mrs Nichols', 1),
 ('guess who fucking', 1),
 ('peristeraRa', 1),
 ('k. 💕', 1),
 ('Finesse kid', 1),
 ('María Mella', 1),
 ('screen.', 1),
 ('WhoopDiGobbleDoo', 1),
 ('Güllüoğlu Bengül', 1),
 ('11/20 4 desi 💋', 1),
 ('jsmoy', 1)]

In [15]:
# How do the profiles of male Twitter users differ from
# those of female users?

male_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'male']

female_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'female']
#male_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'male']

#female_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'female']

import re
def tokenize(s):
    return re.sub('\W+', ' ', s).lower().split() if s else []

male_words = Counter()
female_words = Counter()

for p in male_profiles:
    male_words.update(Counter(tokenize(p)))
                      
for p in female_profiles:
    female_words.update(Counter(tokenize(p)))

print('Most Common Male Terms:')
pprint(male_words.most_common(10))
    
print('\nMost Common Female Terms:')
pprint(female_words.most_common(10))

Most Common Male Terms:
[('i', 32),
 ('and', 28),
 ('the', 27),
 ('a', 24),
 ('is', 19),
 ('for', 14),
 ('of', 14),
 ('t', 13),
 ('to', 13),
 ('my', 11)]

Most Common Female Terms:
[('i', 43),
 ('a', 42),
 ('and', 28),
 ('the', 25),
 ('to', 22),
 ('my', 19),
 ('you', 16),
 ('is', 16),
 ('for', 12),
 ('m', 12)]


In [16]:
print(len(male_words))
print(len(female_words))

772
974


In [17]:
# Compute difference
diff_counts = dict([(w, female_words[w] - male_words[w])
                    for w in
                    set(female_words.keys()) | set(male_words.keys())])

sorted_diffs = sorted(diff_counts.items(), key=lambda x: x[1])

print('Top Male Terms (diff):')
pprint(sorted_diffs[:10])

print('\nTop Female Terms (diff):')
pprint(sorted_diffs[-10:])

Top Male Terms (diff):
[('of', -5),
 ('player', -5),
 ('de', -4),
 ('them', -4),
 ('professional', -3),
 ('entertainment', -3),
 ('19', -3),
 ('all', -3),
 ('x', -3),
 ('is', -3)]

Top Female Terms (diff):
[('ig', 7),
 ('lover', 7),
 ('sc', 8),
 ('no', 8),
 ('16', 8),
 ('my', 8),
 ('to', 9),
 ('snapchat', 10),
 ('i', 11),
 ('a', 18)]


** A problem with difference of counts:**

<br><br><br><br>
What if we have more male than female words in total?

<br><br><br><br>
Instead, consider "the probability that a male user writes the word **w**"

<br><br><br><br>

$$p(w|\hbox{male}) = \frac{\hbox{freq}(w, \hbox{male})}
{\sum_i \hbox{freq}(w_i, \hbox{male})} $$

** Odds Ratio (OR)**

The ratio of the probabilities for a word from each class:

$$ OR(w) = \frac{p(w|\hbox{female})}{p(w|\hbox{male})} $$


- High values --> more likely to be written by females
- Low values --> more likely to be written by males


In [18]:
def counts_to_probs(gender_words):
    """ Compute probability of each term according to the frequency
    in a gender. """
    total = sum(gender_words.values())
    return dict([(word, count / total)
                 for word, count in gender_words.items()])

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)

print('p(w|male)')
pprint(sorted(male_probs.items(), key=lambda x: -x[1])[:10])

print('\np(w|female)')
pprint(sorted(female_probs.items(), key=lambda x: -x[1])[:10])

p(w|male)
[('i', 0.026845637583892617),
 ('and', 0.02348993288590604),
 ('the', 0.022651006711409395),
 ('a', 0.020134228187919462),
 ('is', 0.015939597315436243),
 ('for', 0.01174496644295302),
 ('of', 0.01174496644295302),
 ('t', 0.010906040268456376),
 ('to', 0.010906040268456376),
 ('my', 0.009228187919463088)]

p(w|female)
[('i', 0.026348039215686275),
 ('a', 0.025735294117647058),
 ('and', 0.01715686274509804),
 ('the', 0.015318627450980392),
 ('to', 0.013480392156862746),
 ('my', 0.011642156862745098),
 ('you', 0.00980392156862745),
 ('is', 0.00980392156862745),
 ('for', 0.007352941176470588),
 ('m', 0.007352941176470588)]


In [19]:
def odds_ratios(male_probs, female_probs):
    return dict([(w, female_probs[w] / male_probs[w])
                 for w in
                 set(male_probs) | set(female_probs)])#change | to & to run the following

ors = odds_ratios(male_probs, female_probs)
print('words mentioned both in male and female users:\n')
print('10 words with largest ratio, mentioned more by female\n')
pprint(sorted(ors.items(), key=lambda x: -x[1])[:10])
print('\n10 words with smallest ratio, mentioned more by male\n')
pprint(sorted(ors.items(), key=lambda x: x[1])[:10])

KeyError: 'back'

In [20]:
print(len(male_probs))
print(len(female_probs))
#print(female_probs['rock'])
print('rock' in male_probs)
print('rock' in female_probs)

772
974
True
True


** How to deal with 0-probabilities? **

$$p(w|\hbox{male}) = \frac{\hbox{freq}(w, \hbox{male})}
{\sum_i \hbox{freq}(w_i, \hbox{male})} $$

$\hbox{freq}(w, \hbox{male}) = 0$

Do we really believe there is **0** probability of a male using this term?

(Recall over-fitting discussion.)
<br><br><br><br>

** Additive Smoothing **

Reserve small amount of counts (e.g., 1) for unseen observations.

E.g., assume we've seen each word at least once in each class.

$$p(w|\hbox{male}) = \frac{\hbox{1 + freq(w,male)}}
{|W| + \sum_i \hbox{freq}(w_i, \hbox{male})}$$

(notes:

if w does not exist before $$p(w|\hbox{male}) = \frac{\hbox{1}}
{|W| + \sum_i \hbox{freq}(w_i, \hbox{male})}$$
if w exists, $$p(w|\hbox{male}) = \frac{\hbox{freq(w, male)}}
{|W| + \sum_i \hbox{freq}(w_i, \hbox{male})}$$

)

$|W|$: number of unique words.

In [21]:
# Additive smoothing. Add count of 1 for all words.
all_words = set(male_words) | set(female_words)
print(len(male_words.keys()))
male_words.update(all_words)
print(len(male_words.keys()))
female_words.update(all_words)

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)
print('\n'.join(str(x) for x in 
                sorted(male_probs.items(), key=lambda x: -x[1])[:10]))
print(sum(male_probs.values()))

772
1549
('i', 0.012039401678219628)
('and', 0.010580080262677856)
('the', 0.010215249908792412)
('a', 0.009120758847136081)
('is', 0.0072966070777088655)
('for', 0.005472455308281649)
('of', 0.005472455308281649)
('t', 0.005107624954396206)
('to', 0.005107624954396206)
('my', 0.004377964246625319)
1.0000000000000164


In [22]:
#print([k for k,v in male_words.items() if v==1])
print(len(male_words)-len([v for v in male_words.values() if v==1]))

772


In [31]:
print(772/sum(male_words.values()))

0.2816490331995622


In [None]:
# Even though word doesn't appear, has non-zerp probability.
print(male_probs['rock'])

In [None]:
ors = odds_ratios(male_probs, female_probs)

sorted_ors = sorted(ors.items(), key=lambda x: -x[1])

print('Top Female Terms (OR):')
pprint(sorted_ors[:20])

print('\nTop Male Terms (OR):')
pprint(sorted_ors[-20:])