# CS579: Lecture 12  

** Demographic Inference I**

*[Dr. Aron Culotta](http://cs.iit.edu/~culotta)*  
*[Illinois Institute of Technology](http://iit.edu)*

# Midterm review 

~5 True/False, ~6 short answer
 
**Topics:** 
- Twitter API 
 - what comes in a tweet? 
 - how do rate limits work? 
 - can you understand API documentation? 
- graph basics: 
 - directed/undirected 
 - path 
 - cycle 
 - connected 
 - connected component 
 - degree (distribution) 
 - diameter 
 - average path length 
 - clustering coefficient 
- modeling networks 
 - random graphs 
 - regular graphs 
 - rewired graphs 
 - what makes a small world? 
- community detection 
 - girvan-newman (betweenness) 
 - graph cuts 
 - representing graphs with matrices 
 - graph laplacian 
- link prediction 
 - shortest path 
 - common neighbors 
 - jaccard 
 - preferential attachment 
 - sim rank 
 - evaluation 
- information cascades 
 - urn experiment 
 - bayes' theorem for decision making 
 - game-theoretic model 
 - maximizing payoff 
	- cluster density 
- sentiment analysis 
 - lexicon approach 
 - machine learning 
 
**Question types:** 
- What does this algorithm output? 
 - E.g., what is jaccard score for a specific link? 
 - E.g., what is the next step in girvan-newman? 
- What does this code do? 
 - E.g., I give you a new graph-generating algorithm, tell me what it produces 
- Write a new algorithm 
 - E.g., provide pseudo-code the linear-threshold cascade model 
- True/False 
 - E.g., small world graphs have higher clustering coefficients than random graphs.

**dem·o·graph·ics**

statistical data relating to the population and particular groups within it.

E.g., age, ethnicity, gender, income, ...

# Why Demographics?

- Marketing
  - Who are my customers?
  - Who are my competitors' customers?
  - E.g., [DemographicsPro](http://www.demographicspro.com/samples#c=%40FamilyGuyonFOX)
  
- Social Media as Surveys
  - E.g., 45% of tweets express positive sentiment toward Pres. Obama
  - Who wrote those tweets?
  
- Health
  - 2% of Facebook users are expressing flu-like symptoms
  - Are they representative of the full population?



** User profiles vary from site to site. **

![rahm](rahm.png)

![rahm-fb](rahm-fb.png)

![rahm-li](rahm-li.png)

# Approaches

- Clever use of external data
  - E.g., U.S. Census name lists for gender
- Look for keywords in profile
  - "African American Male"
  - "Happy 21st birthday to me"
- Machine Learning

In [52]:
# Guessing gender
import ConfigParser
from TwitterAPI import TwitterAPI

def get_twitter(config_file):
    """ Read the config_file and construct an instance of TwitterAPI.
    Args:
      config_file ... A config file in ConfigParser format with Twitter credentials
    Returns:
      An instance of TwitterAPI.
    """
    config = ConfigParser.ConfigParser()
    config.read(config_file)
    twitter = TwitterAPI(
                   config.get('twitter', 'consumer_key'),
                   config.get('twitter', 'consumer_secret'),
                   config.get('twitter', 'access_token'),
                   config.get('twitter', 'access_token_secret'))
    return twitter


def robust_request(twitter, resource, params, max_tries=5):
    """ If a Twitter request fails, sleep for 15 minutes.
    Do this at most max_tries times before quitting.
    Args:
      twitter .... A TwitterAPI object.
      resource ... A resource string to request.
      params ..... A parameter dictionary for the request.
      max_tries .. The maximum number of tries to attempt.
    Returns:
      A TwitterResponse object, or None if failed.
    """
    for i in range(max_tries):
        request = twitter.request(resource, params)
        if request.status_code == 200:
            return request
        else:
            print >> sys.stderr, 'Got error:', request.text, '\nsleeping for 15 minutes.'
            sys.stderr.flush()
            time.sleep(60 * 15)

twitter = get_twitter('twitter.cfg')
request = robust_request(twitter, 'search/tweets',
                                  {'q': 'DonaldTrump',
                                  'count': '100'})
tweets = [t for t in request]
print 'fetched %d tweets' % len(tweets)

fetched 100 tweets


In [53]:
# Print 10 names.
names = [t['user']['name'] for t in tweets]
print '\n'.join(names[:10])

Amanda Davis
Star Sievert  
Sheilah Morfin
Aurore Nowlin  
Justin Baragona
Jacob Williams
Darth Shwa
Contemptor
Lisette Pond
Trang Younkin


In [54]:
# Fetch census name data from:
# http://www.census.gov/genealogy/www/data/1990surnames/index.html

import requests
males_url = 'http://www2.census.gov/topics/genealogy/' + \
            '1990surnames/dist.male.first'
females_url = 'http://www2.census.gov/topics/genealogy/' + \
              '1990surnames/dist.female.first'
males = requests.get(males_url).text.split('\n')
females = requests.get(females_url).text.split('\n')
print 'males:\n', '\n'.join(males[:10])
print '\nfemales:\n', '\n'.join(females[:10])

males:
JAMES          3.318  3.318      1
JOHN           3.271  6.589      2
ROBERT         3.143  9.732      3
MICHAEL        2.629 12.361      4
WILLIAM        2.451 14.812      5
DAVID          2.363 17.176      6
RICHARD        1.703 18.878      7
CHARLES        1.523 20.401      8
JOSEPH         1.404 21.805      9
THOMAS         1.380 23.185     10

females:
MARY           2.629  2.629      1
PATRICIA       1.073  3.702      2
LINDA          1.035  4.736      3
BARBARA        0.980  5.716      4
ELIZABETH      0.937  6.653      5
JENNIFER       0.932  7.586      6
MARIA          0.828  8.414      7
SUSAN          0.794  9.209      8
MARGARET       0.768  9.976      9
DOROTHY        0.727 10.703     10


In [55]:
# Get names. 
male_names = set([m.split()[0].lower() for m in males if m])
female_names = set([f.split()[0].lower() for f in females if f])
print('%d male and %d female names' % (len(male_names), len(female_names)))
print 'males:\n', '\n'.join(list(male_names)[:10])
print '\nfemales:\n', '\n'.join(list(female_names)[:10])

1219 male and 4275 female names
males:
trenton
darrin
emile
jason
ron
ali
rob
rod
monte
steve

females:
fawn
kymberly
augustina
evalyn
augustine
chieko
linsey
hermina
shenika
sonja


In [56]:
def gender_by_name(tweets, male_names, female_names):
    for t in tweets:
        name = t['user']['name']
        if name:
            first = name.split()[0].lower()
            if first in male_names:
                t['gender'] = 'male'
            elif first in female_names:
                t['gender'] = 'female'
            else:
                t['gender'] = 'unknown'

gender_by_name(tweets, male_names, female_names)

In [57]:
from collections import Counter

def print_genders(tweets):
    print 'gender counts:\n', Counter([t['gender'] for t in tweets])
    for t in tweets[:20]:
        print t['gender'], t['user']['name']
    
print_genders(tweets)

gender counts:
Counter({'female': 37, 'male': 32, 'unknown': 31})
female Amanda Davis
female Star Sievert  
female Sheilah Morfin
female Aurore Nowlin  
male Justin Baragona
male Jacob Williams
unknown Darth Shwa
unknown Contemptor
female Lisette Pond
female Trang Younkin
female Jessica White
unknown Rykier
male Victor Stewart
female esther
male Walter Branum
male Andrew
male Tracy
female Juliann Jack  
female Tonya Russell
female  Betty Williams


In [58]:
# What about ambiguous names?

def print_ambiguous_names(male_names, female_names):
    ambiguous = [n for n in male_names if n in female_names]
    print 'found %d ambiguous names:\n'% len(ambiguous)
    print '\n'.join(ambiguous[:20])
    
print_ambiguous_names(male_names, female_names)

found 331 ambiguous names:

jason
ali
roy
marion
cameron
sung
cody
jessie
paris
demetrius
young
aaron
edward
daryl
billie
jack
andre
louis
joel
michael


In [59]:
# Keep names that are more frequent in one gender than the other.
males_pct = dict([(m.split()[0].lower(), float(m.split()[1]))
                  for m in males if m])
females_pct = dict([(f.split()[0].lower(), float(f.split()[1]))
                    for f in females if f])

male_names = set([m for m in male_names if m not in female_names or
              males_pct[m] > females_pct[m]])
female_names = set([f for f in female_names if f not in male_names or
              females_pct[f] > males_pct[f]])

print_ambiguous_names(male_names, female_names)
print('%d male and %d female names' % (len(male_names), len(female_names)))

found 0 ambiguous names:


1146 male and 4017 female names


In [61]:
gender_by_name(tweets, male_names, female_names)
print_genders(tweets)

gender counts:
Counter({'female': 40, 'unknown': 31, 'male': 29})
female Amanda Davis
female Star Sievert  
female Sheilah Morfin
female Aurore Nowlin  
male Justin Baragona
male Jacob Williams
unknown Darth Shwa
unknown Contemptor
female Lisette Pond
female Trang Younkin
female Jessica White
unknown Rykier
male Victor Stewart
female esther
male Walter Branum
male Andrew
female Tracy
female Juliann Jack  
female Tonya Russell
female  Betty Williams


In [62]:
# Who are the unknowns?
# "Filtered" data can have big impact on analysis.

unknown_names = Counter(t['user']['name']
                        for t in tweets if t['gender'] == 'unknown')
print '\n'.join(str(x) for x in unknown_names.most_common(20))

(u'The Sedrious God', 2)
(u'BuffaloInABox.com', 1)
(u'Rykier', 1)
(u'Sophia92Brown', 1)
(u'JJ', 1)
(u'Akasha ~ Alien ~ K11', 1)
(u'Nazario Esquivel', 1)
(u'Contemptor', 1)
(u'Hood Classics', 1)
(u'Mr.Migazzz', 1)
(u"Jason's Grandpa", 1)
(u'K.I. Haaland', 1)
(u'MrsP', 1)
(u'Black Lives Mattered', 1)
(u'redpilltwiceaday', 1)
(u'TxSaya', 1)
(u'la vida', 1)
(u'Lary Garecki', 1)
(u'#GodCountryfamily ', 1)
(u'AbigailAdamsBrigade', 1)


In [78]:
# How do the profiles of male Twitter users differ from
# those of female users?

male_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'male']

female_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'female']
#male_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'male']

#female_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'female']

import re
def tokenize(s):
    return re.sub('\W+', ' ', s).lower().split()

male_words = Counter()
female_words = Counter()

for p in male_profiles:
    male_words.update(Counter(tokenize(p)))
                      
for p in female_profiles:
    female_words.update(Counter(tokenize(p)))

print 'Most Common Male Terms:\n', \
    '\n'.join(str(x) for x in male_words.most_common(10))
    
print '\nMost Common Female Terms:\n', \
    '\n'.join(str(x) for x in female_words.most_common(10))

Most Common Male Terms:
(u'of', 11)
(u'husband', 5)
(u'bonniebo40', 5)
(u'to', 4)
(u'porque', 4)
(u't', 4)
(u'alcohol', 4)
(u'i', 4)
(u'artist', 3)
(u'christian', 3)

Most Common Female Terms:
(u'of', 14)
(u'and', 10)
(u'lover', 9)
(u'for', 5)
(u'coach', 4)
(u'fan', 4)
(u'music', 4)
(u'love', 4)
(u'junkie', 4)
(u'earth', 3)


In [79]:
print len(male_words)
print len(female_words)

233
301


In [80]:
# Compute difference
diff_counts = dict([(w, female_words[w] - male_words[w])
                    for w in
                    set(female_words.keys()) | set(male_words.keys())])

sorted_diffs = sorted(diff_counts.items(), key=lambda x: x[1])

print 'Top Male Terms (diff):\n', \
    '\n'.join(str(x) for x in sorted_diffs[:10])

print '\nTop Female Terms (diff):\n', \
    '\n'.join(str(x) for x in sorted_diffs[-10:])

Top Male Terms (diff):
(u'bonniebo40', -5)
(u'husband', -5)
(u'porque', -4)
(u't', -4)
(u'alcohol', -4)
(u'co', -3)
(u'pilot', -3)
(u'http', -3)
(u'no', -3)
(u'u', -3)

Top Female Terms (diff):
(u'in', 3)
(u'citizens', 3)
(u'am', 3)
(u'coach', 4)
(u'fan', 4)
(u'love', 4)
(u'junkie', 4)
(u'for', 5)
(u'lover', 7)
(u'and', 9)


** A problem with difference of counts:**

What if we have more male than female words in total?

*Solution:* ** Odds Ratio (OR)**

$$ OR(w) = \frac{p(w|\hbox{female})}{p(w|\hbox{male})} $$

$$p(w|\hbox{female}) = \frac{\hbox{freq}(w, \hbox{female})}
{\sum_i \hbox{freq}(w_i, \hbox{female})} $$

In [81]:
def counts_to_probs(gender_words):
    """ Compute probability of each term according to the frequency
    in a gender. """
    total = 1. * sum(gender_words.values())
    return dict([(word, count / total)
                 for word, count in gender_words.items()])

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)
print sorted(male_probs.items(), key=lambda x: -x[1])[:10]

[(u'of', 0.035483870967741936), (u'husband', 0.016129032258064516), (u'bonniebo40', 0.016129032258064516), (u'to', 0.012903225806451613), (u'porque', 0.012903225806451613), (u't', 0.012903225806451613), (u'alcohol', 0.012903225806451613), (u'i', 0.012903225806451613), (u'u', 0.00967741935483871), (u'artist', 0.00967741935483871)]


In [82]:
def odds_ratios(male_probs, female_probs):
    return dict([(w, female_probs[w] / male_probs[w])
                 for w in set(male_probs) | set(female_probs)])

ors = odds_ratios(male_probs, female_probs)

KeyError: u'all'

In [83]:
print len(male_probs)
print len(female_probs)

233
301


** How to deal with 0-probabilities? **

$$p(w|\hbox{male}) = \frac{\hbox{freq}(w, \hbox{male})}
{\sum_i \hbox{freq}(w_i, \hbox{male})} $$

$\hbox{freq}(w, \hbox{male}) = 0$

Do we really believe there is **0** probability of a male using this term?


** Additive Smoothing **

Reserve small amount of counts (e.g., 1) for unseen observations.

E.g., assume we've seen each word at least once in each class.

$$p(w|\hbox{male}) = \frac{\hbox{1 + freq}(w, \hbox{male})}
{|W| + \sum_i \hbox{freq}(w_i, \hbox{male})} $$

$|W|$: number of unique words.

In [84]:
# Additive smoothing. Add count of 1 for all words.
all_words = set(male_words) | set(female_words)
male_words.update(all_words)  
female_words.update(all_words)

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)
print '\n'.join(str(x) for x in 
                sorted(male_probs.items(), key=lambda x: -x[1])[:10])

(u'of', 0.015503875968992248)
(u'bonniebo40', 0.007751937984496124)
(u'husband', 0.007751937984496124)
(u'to', 0.006459948320413436)
(u'alcohol', 0.006459948320413436)
(u'porque', 0.006459948320413436)
(u'i', 0.006459948320413436)
(u't', 0.006459948320413436)
(u'christian', 0.00516795865633075)
(u'guru', 0.00516795865633075)


In [85]:
ors = odds_ratios(male_probs, female_probs)

sorted_ors = sorted(ors.items(), key=lambda x: -x[1])

print 'Top Female Terms (OR):\n', \
    '\n'.join(str(x) for x in sorted_ors[:20])

print '\nTop Male Terms (OR):\n', \
    '\n'.join(str(x) for x in sorted_ors[-20:])

Top Female Terms (OR):
(u'for', 5.086527929901424)
(u'and', 4.662650602409639)
(u'coach', 4.238773274917853)
(u'fan', 4.238773274917853)
(u'love', 4.238773274917853)
(u'junkie', 4.238773274917853)
(u'drinking', 3.3910186199342824)
(u'family', 3.3910186199342824)
(u'god', 3.3910186199342824)
(u'earth', 3.3910186199342824)
(u'grandmother', 3.3910186199342824)
(u'many', 3.3910186199342824)
(u'in', 3.3910186199342824)
(u'citizens', 3.3910186199342824)
(u'am', 3.3910186199342824)
(u'lover', 2.8258488499452357)
(u'hats', 2.543263964950712)
(u'spiritual', 2.543263964950712)
(u'healer', 2.543263964950712)
(u'ever', 2.543263964950712)

Top Male Terms (OR):
(u'age', 0.4238773274917853)
(u'all', 0.28258488499452356)
(u'siempre', 0.28258488499452356)
(u'es', 0.28258488499452356)
(u'hack', 0.28258488499452356)
(u'are', 0.28258488499452356)
(u'photos', 0.28258488499452356)
(u'it', 0.28258488499452356)
(u'or', 0.28258488499452356)
(u'editor', 0.28258488499452356)
(u'co', 0.21193866374589265)
(u'pilot